On the Reliability of Clustering Stability in the Large Sample Regime - Supplementary Material Ohad Shamir and Naftali Tishby School of Computer Science and Engineering Interdisciplinary Center for Neural Computation The Hebrew University Jerusalem 91904, Israel {ohadsh,tishby}@cs.huji.ac.il A Exact Formulation of the Sufficient Conditions In this section, we give a mathematically rigorous formulation of the sufficient conditions discussed in the main paper. For that we will need some additional notation. First of all, it will be convenient to define a scaled version of our distance measure dD (Ak (S1 ), Ak (S2 )) between clusterings. Formally, define the random variable a , m (x) dD (Ak (S1 ), Ak (S2 )) := mdD (Ak (S1 ), Ak (S2 )) = m Pr rgmaxf,i (x) = argmaxf ,i ^ ^ xD i i For some > 0 and a set S Rn , let B (S ) be the -neighborhood of S , namely x . B (S ) := X : inf x - y 2 y S where , are the solutions returned by Ak (S1 ), Ak (S2 ), and S1 , S2 are random samples, each of size m, drawn i.i.d from the underlying distribution D. The scaling by the square root of the sample size will allow us to analyze the non-trivial asymptotic behavior of these distance measures, which without scaling simply converge to zero in probability as m . In this paper, when we talk about neighborhoods in general, we will always assume they are uniform (namely, contain an -neighborhood for some positive ). We will also need to define the following variant of dm (Ak (S1 ), Ak (S2 )), where we restrict ourD selves to the mass in some subset of Rn . Formally, we define the restricted distance between two clusterings, with respect to a set B Rn , as a . dm (Ak (S1 ), Ak (S2 ), B ) := m Pr rgmaxf,i (x) = argmaxf ,i (x) x B (1) ^ ^ D xD i i In particular, dm (Ak (S1 ), Ak (S2 ), Br/m (i,j F0 ,i,j )) refers to the mass which switches clusters, D and is also inside an r/ m-neighborhood of the limit cluster boundaries (where the boundaries are defined with respect to f0 (·)). Once again, when S1 , S2 are random samples, we can think of it as a random variable with respect to drawing and clustering S1 , S2 . Conditions. The following conditions shall be assumed to hold: ^ 1. Consistency Condition: converges in probability (over drawing and clustering a sample of size m, m ) to some 0 . Furthermore, the association of clusters to indices {1, . . . , k} is constant in some neighborhood of 0 . ^ 2. Central Limit Condition: m( - 0 ) converges in distribution to a multivariate zero mean Gaussian random variable Z . 1 3. Regularity Conditions: (a) f (x) is Sufficiently Smooth: For any in some neighborhood of 0 , and any x in some neighborhood of the cluster boundaries i,j F0 ,i,j , f (x) is twice continuously differentiable with respect to , with a non-zero first derivative and uniformly bounded second derivative for any x. Both f0 (x) and ( / )f0 (x) are twice differentiable with respect to any x X , with a uniformly bounded second derivative. (b) Limit Cluster Boundaries are Reasonably Nice: For any two clusters i, j , F0 ,i,j is either empty, or a compact, non-self-intersecting, orientable n - 1 dimensional hypersurface in Rn with finite positive volume, a boundary (edge), and with a neighborhood contained in X in which the underlying density function p(·) is continuous. Moreover, the gradient (f0 ,i (·) - f0 ,j (·)) has positive magnitude everywhere on F0 ,i,j . (c) Intersections of Cluster Boundaries are Relatively Negligible: For any two distinct non-empty cluster boundaries F0 ,i,j , F0 ,i ,j , we have that B B 1 1 1dx , 1dx (F0 ,i,j F ,i ,j )B (F0 ,i,j )B (F ,i ,j ) ( F0 ,i,j ) 0 0 converge to 0 as , 0 (in any manner), where F0 ,i is the edge of F0 ,i,j . (d) Minimal Parametric Stability: It holds for some > 0 that where o(1) 0 as m . Namely, the mass of D which switches between clusters is with high probability inside thin strips around the limit cluster boundaries, and this high probability increases at least polynomially as the width of the strips increase (see below for a further discussion of this). ` ´ Pr dm (Ak (S1 ), Ak (S2 )) = dm (Ak (S1 ), Ak (S2 ), Br/m (i,j F 0 ,i,j )) = O(r-3- ) + o(1), D D The regularity assumptions are relatively mild, and can usually be inferred based on the consistency and central limit conditions, as well as the the specific clustering framework that we are considering. For example, condition 3c and the assumptions on F0 ,i,j in condition 3b are fulfilled in a clustering framework where the clusters are separated by hyperplanes. As to condition 3d, suppose our ^ clustering framework is such that the cluster boundaries depend on in a smooth manner. Then the ^ , with variance O(1/m), and the compactness of X , will generally imply asymptotic normality of that the cluster boundaries btained from clustering a sample are contained with high probability o inside strips of width O(1/ m) around the limit cluster boundaries. More specifically, the asymp totic probability of this happening for strips of width r/ m will be exponentially high in r, due ^ to the asymptotic normality of . As a result, the mass which switches between clusters, when we compare two independent clusterings, will be in those strips with probability exponentially high in r. Therefore, condition 3d will hold by a large margin, since only polynomially high probability is required there. B Proofs - General Remarks The proofs will use the additional notation and the sufficient conditions, as presented in Sec. A. Throughout the proofs, we will sometimes use the stochastic order notation Op (·) and op (·) (cf. [8]), defined as follows. Let {Xm } and {Ym } be sequences of random vectors, defined on the same probability space. We write Xm = Op (Ym ) to mean that for each > 0 there exists a real number M such that Pr( Xm M Ym ) < if m is large enough. We write Xm = op (Ym ) to mean that Pr( Xm Ym ) 0 for each > 0. Notice that {Ym } may also be non-random. For example, Xm = op (1) means that Xm 0 in probability. When we write for example Xm = Ym + op (1), we mean that Xm - Ym = op (1). C Proof of Proposition 1 ^ By condition 3a, f (x) has a first order Taylor expansion with respect to any close enough to 0 , with a remainder term uniformly bounded for any x: ^ f (x) = f0 (x) + f (x) ( - 0 ) + o( - 0 ). ^ (2) ^ 0 2 By the asymptotic normality assumption, m - 0 = Op (1), hence - 0 = Op (1/ m). ^ ^ Therefore, we get from Eq. (2) that f = ^ m (x) - f0 (x) (3) f0 (x) ( m( - 0 )) + op (1), ^ where the remainder term op (1) does not depend on x. By regularity condition 3a and compactness of X , ( / )f0 (·) is a uniformly bounded vector-valued function from X to the Euclidean space ^ ^ in which resides. As a result, the mapping (( / )f0 (·)) is a mapping from , with the metric induced by the Euclidean space in which it resides, to the space of all uniformly bounded Rk -valued functions on X . We can turn the latter space into a metric space by equipping it with the obvious extension of the supremum norm (namely, for any two functions f (·), g (·), f - g := supxX f (x) - g (x) , where · is the infinity norm in Euclidean space). With this norm, the ^ mapping above is a continuous mapping between two metric spaces. We also know that m( - 0 ) converges in distribution to a multivariate Gaussian random variable Z . By the continuous mapping theorem [8] and Eq. (3), this implies that m(f (·) - f0 (·)) converges in distribution to a Gaussian ^ process G(·), where G(·) := f (·) Z. (4) 0 D Proof of Thm. 1 D.1 A High Level Description of the Proof The full proof of Thm. 1 is rather long and technical, mostly due to the many technical subtleties that need to be taken care of. Since these might obscure the main ideas, we present here separately a general overview of the proof, without the finer details. The purpose of the stability estimator m,q , scaled by m, boils down to trying to assess the ^k "expected" value of the random variable dm (Ak (S1 ), Ak (S2 )): we estimate q instantiations of D m dD (Ak (S1 ), Ak (S2 )), and take their average. Our goal is to show that this average, taking m , is likely to be close to the value instab(Ak , D) as defined in the theorem. The most straightforward way to go about it is to prove that instab(Ak , D) actually equals limm Edm (Ak (S1 ), Ak (S2 )), D ^k and then use some large deviation bound to prove that m m,q is indeed close to it with high probability, if q is large enough. Unfortunately, computing limm Edm (Ak (S1 ), Ak (S2 )) is probD lematic. The reason is that the convergence tools at our disposal deals with convergence in distribution of random variables, but convergence in distribution does not necessarily imply convergence of expectations. In other words, we can try and analyze the asymptotic distribution of m dD (Ak (S1 ), Ak (S2 )), but the expected value of this asymptotic distribution is not necessarily the same as limm Edm (Ak (S1 ), Ak (S2 )). As a result, we will have to take a more indirect route. D m Here is the basic idea: instead of analyzing the asymptotic expectation of dD (Ak (S1 ), Ak (S2 )), we m analyze the asymptotic expectation of a different random variable, dD (Ak (S1 ), Ak (S2 ), B ), which was formally defined in Eq. (1). Informally, recall that dm (Ak (S1 ), Ak (S2 )) is the mass of the unD derlying distribution D which switches between clusters, when we draw and cluster two indepenm dent samples of size m. Then dD (Ak (S1 ), Ak (S2 ), B ) measures the subset of this mass, which lies inside some B Rn . In particular, following the notation of Sec. A, we will pick B to be dm (Ak (S1 ), Ak (S2 ), Br/m (i,j F0 ,i,j )) for some r > 0. In words, this constitutes strips of width D r/ m around the limit cluster boundaries. Writing the above expression for B as Br/m , we have that if r be large enough, then dm (Ak (S1 ), Ak (S2 ), Br/m ) is equal to dm (Ak (S1 ), Ak (S2 )) with D D very high probability over drawing and clustering a pair of samples, for any large enough sample size m. Basically, this is because the fluctuations of the cluster boundaries, based on drawing and clustering a random sample of size m, cannot be too large, and therefore the mass which switches clusters is concentrated around the limit cluster boundaries, if m is large enough. The advantage of the 'surrogate' random variable dm (Ak (S1 ), Ak (S2 ), Br/m ) is that it is bounded D for any finite r, unlike dm (Ak (S1 ), Ak (S2 )). With bounded random variables, convergence in D distribution does imply convergence of expectations, and as a result we are able to calculate limm Edm (Ak (S1 ), Ak (S2 ), Br/m ) explicitly. This will turn out to be very close to D 3 instab(Ak , D) as it appears in the theorem (in fact, we can make it arbitrarily close to instab(Ak , D) by making r large enough). Using the fact that dm (Ak (S1 ), Ak (S2 ), Br/m ) and dm (Ak (S1 ), Ak (S2 )) D D are equal with very high probability, we show that conditioned on a highly probable event, m m,q is an unbiased estimator of dm (Ak (S1 ), Ak (S2 ), Br/m ), based on q instantiations, for ^k D ^k any sample size m. As a result, using large deviation bounds, we get that m m,q is close to m dD (Ak (S1 ), Ak (S2 ), Br/m ), with a high probability which does not depend on m. Therefore, as ^k m , m m,q will be close to limm Edm (Ak (S1 ), Ak (S2 ), Br/m ) with high probability. D By picking r to scale appropriately with q , our theorem follows. For convenience, the proof is divided into two parts: in Subsec. D.2, we calculate limm Edm (Ak (S1 ), Ak (S2 ), Br/m ) explicitly, while Subsec. D.3 executes the general plan outD lined above to prove our theorem. A few more words are in order about the calculation of limm Edm (Ak (S1 ), Ak (S2 ), Br/m ) D in Subsec. D.2, since it is rather long and involved in itself. Our goal is to perform this calculation without going through an intermediate step of explicitly characterizing the distribution of m dD (Ak (S1 ), Ak (S2 ), Br/m ). This is because the distribution might be highly dependent on the specific clustering framework, and thus it is unsuitable for the level of generality which we aim at (in other words, we do not wish to assume a specific clustering framework). The idea is as follows: recall that dm (Ak (S1 ), Ak (S2 ), Br/m ) is the mass of the underlying distribution D, inside strips of D width r/ m around the limit cluster boundaries, which switches clusters when we draw and cluster two independent samples of size m. For any x X , let Ax be the event that x switched clusters. Then we can write dm (Ak (S1 ), Ak (S2 ), Br/m ), by Fubini's theorem, as: D Edm (Ak (S1 ), Ak (S2 ), Br/m ) = D mE B 1(Ax )p(x)dx = r/ m B m Pr(Ax )p(x)dx. r/ m (5) The heart of the proof is Lemma D.5, which considers what happens to the integral above inside a single strip near one of the limit cluster boundaries F0 ,i,j . The main body of the proof then shows how the result of Lemma D.5 can be combined to give the asymptotic value of Eq. (5) when we take the integral over all of Br/m . The bottom line is that we can simply sum the contributions from each strip, because the intersection of these different strips is asymptotically negligible. All the other lemmas in Subsec. D.2 develop technical results needed for our proof. Finally, let us describe the proof of Lemma D.5 in a bit more detail. It starts with an expression equivalent to the one in Eq. (5), and transforms it to an expression composed of a constant value, and a remainder term which converges to 0 as m . The development can be divided into a number of steps. The first step is rewriting everything using the asymptotic Gaussian distribution of the cluster association function f (x) for each x, plus remainder terms (Eq. (13)). Since we are ^ integrating over x, special care is given to show that the convergence to the asymptotic distribution is uniform for all x in the domain of integration. The second step is to rewrite the integral (which is over a strip around the cluster boundary) as a double integral along the cluster boundary itself, and along a normal segment at any point on the cluster boundary (Eq. (14)). Since the strips become arbitrarily small as m , the third step consists of rewriting everything in terms of a Taylor expansion around each point on the cluster boundary (Eq. (16), Eq. (17) and Eq. (18)). The fourth and final step is a change of variables, and after a few more manipulations we get the required result. D.2 Part 1: Auxiliary Result As described in the previous subsection, we will need an auxiliary result (Proposition D.1 below), characterizing the asymptotic expected value of dm (Ak (S1 ), Ak (S2 ), Br/m (i,j F0 ,i,j )). D Proposition D.1. Let r > 0. Assuming the set of conditions from Sec. A holds, limm Edm (Ak (S1 ), Ak (S2 ), Br/m (i,j F0 ,i,j )) is equal to D V 1 1 F p(x) ar(Gi (x) - Gj (x)) x, 2 - h(r) (f0 ,i (x) - f0 ,j (x)) d 0 ,i,j i 0, B S 1 1 g (x + y nx )dy dx + o(1), (6) g (x)dx = (S ) - where nx is a unit normal vector to S at x, and o(1) 0 as 0. Proof. Let B (S ) be a strip around S , composed of all points which are on some normal to S and close enough to S : B (S ) := {y Rn : x S, y [-, ], y = x + y nx }. Since S is orientable, then for small enough > 0, B (S ) is diffeomorphic to S × [-, ]. In particular, the map : S × [-, ] B (S ), defined by (x, y ) = x + y nx will be a diffeomorphism. Let D(x, y ) be the Jacobian of at the point (x, y ) S × [-, ]. Note that D(x, 0) = 1 for every x S . We now wish to claim that as 0, B B 1 1 g (x)dx = g (x)dx + o(1). (S ) (S ) (7) To see this, we begin by noting that B (S ) B (S ). Moreover, any point in B (S ) \ B (S ) has the property that its projection to the closest point in S is not a normal to S , and thus must be -close to the edge of S . As a result of regularity condition 3c for S , and the fact that g (·) is continuous and hence uniformly bounded in the volume of integration, we get that the integration of g (·) over B \ B is asymptotically negligible (as 0), and hence Eq. (7) is justified. By the change of variables theorem from multivariate calculus, followed by Fubini's theorem, and using the fact that D is continuous and equals 1 on S × {0}, B S 1 1 g (x)dx = g (x + y nx )D(x, y )dxdy (S ) ×[-,] S d 1 g (x + y nx )D(x, y )dx y = - d S 1 g (x + y nx )dx y + o(1), = - where o(1) 0 as 0. Combining this with Eq. (7) yields the required result. Lemma D.2. Let (gm : X R)=1 be a sequence of integrable functions, such that gm (x) 0 m uniformly for all x as m . Then for any i, j {1, . . . , k}, i = j , B mgm (x)p(x)dx 0 r / m (F 0 ,i,j ) as m Proof. By the assumptions on (gm (·))=1 , there exists a sequence of positive constants (bm )=1 , m m converging to 0, such that B B mp(x)dx. mgm (x)p(x)dx bm r / m (F 0 ,i,j ) r / m (F 0 ,i,j ) 5 For large enough m, p(x) is bounded and continuous in the volume of integration. Applying Lemma D.1 with = r/ m, we have that as m , B F r / m bm m p(x)dx = bm m p(x + y nx )dy dx + o(1) r / m (F 0 ,i,j ) 0 ,i,j C bm m + o(1) = bm C + o(1) m -r / m for some constant C dependant on r and the upper bound on p(·). Since bm converge to 0, we have that the expression in the lemma converges to 0 as well. Lemma D.3. Let (Xm ) and (Ym ) be a sequence of real random variables, such that Xm , Ym are defined on the same probability space, and Xm - Ym converges to 0 in probability. Assume that Ym converges in distribution to a continuous random variable Y . Then | Pr(Xm c) - Pr(Ym c)| converges to 0 uniformly for all c R. Proof. We will use the following standard fact (see for example section 7.2 of [4]): for any two real random variables A, B , any c R and any > 0, it holds that Pr(A c) Pr(B c + ) + Pr(|A - B | > ). From this inequality, it follows that for any c R and any > 0, + P | Pr(Xm c) - Pr(Ym c)| r(Ym c + ) - Pr(Ym c) P + r(Ym c) - Pr(Ym c - ) Pr(|Xm - Ym | ). (8) We claim that the r.h.s of Eq. (8) converges to 0 uniformly for all c, from which the lemma follows. To see this, we begin by noticing that Pr(|Xm - Ym | ) converges to 0 for any by definition of convergence in probability. Next, Pr(Ym c ) converges to Pr(Y c ) uniformly for all c R, since Y is continuous (see section 1 of [6]). Moreover, since Y is a continuous random variable, we have that its distribution function is uniformly continuous, hence Pr(Y c + ) - Pr(Y c) and Pr(Y c) - Pr(Y c - ) converges to 0 as 0, uniformly for all c. Therefore, by letting m , and 0 at an appropriate rate compared to m, we have that the l.h.s of Eq. (8) converges to 0 uniformly for all c. < a b) converges to Pr( a, G(x) < b) uniformly for Lemma D.4. Pr( , m(f (x) - f0 (x)) ^ any x X , any a = 0 in some bounded subset of Rk , and any b R. Proof. By Eq. (3), m f ^ (x) - f 0 (x) = f0 (x) ^ ( m( - 0 )) + op (1). Where the remainder term does not depend on x. Thus, for any a in a bounded subset of Rk , f a , m (x) - f0 (x) ^ = a f0 (x) ^ , m( - 0 ) + op (1), (9) Where the convergence in probability is uniform for all bounded a and x X . We now need to use a result which tells us when is a convergence in distribution uniform. Using thm. 4.2 in [6], we have that if a sequence of random vectors (Xm )=1 in Euclidean space converge to a m random variable X in distribution, then Pr( y, Xm < b) converges to Pr( y, X < b) uniformly for any vector y and b R. We note that a stronger result (Thm. 6 in [2]) apparently allows us to extend this to cases where Xm and X reside in some infinite dimensional, separable Hilbert space (for example, if is a subset of an infinite dimensional reproducing kernel Hilbert space in kernel ^ clustering). Therefore, recalling that m( - 0 ) converges in distribution to a random normal vector Z , we have that uniformly for all x, a, b, 6 Pr a f0 (x) ^ , m( - 0 ) < b = = Pr a f0 (x) ,Z < b + o(1) (10) Pr ( a, G(x) < b) + o(1) y Here we think of a(( / )f0 (x)) as the vectora to which we apply the theorem. By regularity i condition 3a, and assuming a = 0, we have that (( / )f0 (x)) , Z s a continuous real random variable for any x, unless Z = 0 in which case the lemma is trivial. Therefore, the conditions of Lemma D.3 apply: the two sides of Eq. (9) give us two sequences of random variables which converge in probability to each other, and by Eq. (10) we have convergence in distribution of one of the sequences to a fixed continuous random variable. Therefore, using Lemma D.3, we have that Pr f a , m (x) - f0 (x) ^ < b = Pr a f0 (x) ^ , m( - 0 ) < b + o(1), (11) where the convergence is uniform for any bounded a = 0, b and x X . Combining Eq. (10) and Eq. (11) gives us the required result. Lemma D.5. Fix some two clusters i, j . Assuming the expression below is integrable, we have that B 2 m Pr(f,i (x) - f,j (x) < 0) Pr(f,i (x) - f,j > 0)p(x)dx ^ ^ ^ ^ r / m (F 0 ,i,j ) =2 1 - h(r) F 0 ,i,j V p(x) ar(Gi (x) - Gj (x)) x + o(1) (f0 ,i (x) - f0 ,j (x)) d where o(1) 0 as m and h(r) = O(exp(-r2 )). Proof. Define a Rk as ai = 1, aj = -1, and 0 for any other entry. Applying Lemma D.4, with a as above, we have that uniformly for all x in some small enough neighborhood around F0 ,i,j : Pr(f,i (x) - f,j (x) < 0) ^ ^ = = Pr m(f,i (x) - f0 ,i (x)) - m(f,j (x) - f0 ,j (x)) < m(f0 ,j (x) - f0 ,i (x)) ^ ^ Pr(Gi (x) - Gj (x) < m(f0 ,j (x) - f0 ,i (x))) + o(1). where o(1) converges uniformly to 0 as m . Since Gi (x) - Gj (x) has a zero mean normal distribution, we can rewrite the above (if Var(Gi (x) - Gj (x)) > 0) as G + m(f0 ,j (x) - f0 ,i (x)) i (x) - Gj (x) V V < o(1) Pr ar(Gi (x) - Gj (x)) ar(Gi (x) - Gj (x)) + m(f0 ,j (x) - f0 ,i (x)) V = o(1), (12) ar(Gi (x) - Gj (x)) where (·) is the cumulative standard normal distribution function. Notice that by some abuse of notation, the expression is also valid in the case where Var(Gi (x) - Gj (x)) = . In that case, 0 Gi (x) - Gj (x) is equal to 0 with probability 1, and thus Pr(Gi (x) - Gj (x) < m(f0 ,j (x) - f0 ,i (x))) is 1 if f0 ,j (x) - f0 ,i (x)) 0 and 0 if f0 ,j (x) - f0 ,i (x)) < 0. This is equal to Eq. (12) if we are willing to assume that () = 1, (0/0) = 1, (-) = 0. 7 One can verify that the expression inside the integral is a continuous function of x, by the regularity conditions and the expression for G(·) as proven in Sec. C (namely Eq. (4)). We can therefore apply Lemma D.1, and again take all the remainder terms outside of the integral by Lemma D.2, to get that the above can be rewritten as r / m 1 m(f0 ,i (x + y nx ) - f0 ,j (x + y nx )) V m ar(Gi (x + y nx ) - Gj (x + y nx )) m(f0 ,i (x + y nx ) - f0 ,j (x + y nx )) V - ar(Gi (x + y nx ) - Gj (x + y nx )) The integration of the remainder term can be rewritten as o(1) by Lemma D.2, and we get that the expression can be rewritten as: 1 B m(f0 ,i (x) - f0 ,j (x)) V m 2 ar(Gi (x) - Gj (x)) r / m (F 0 ,i,j ) p m(f0 ,i (x) - f0 ,j (x)) V (x)dx + o(1). (13) - ar(Gi (x) - Gj (x)) Therefore, we can rewrite the l.h.s of the equation in the lemma statement as 1 B m(f0 ,i (x) - f0 ,j (x)) V m 2 ar(Gi (x) - Gj (x)) r / m (F 0 ,i,j ) + m(f0 ,i (x) - f0 ,j (x)) V mo(1)p(x)dx. - ar(Gi (x) - Gj (x)) 2 F 0 ,i,j -r / m p (x)dy dx + o(1), (14) where nx is a unit normal to F0 ,i,j at x. Inspecting Eq. (14), we see that y ranges over an arbitrarily small domain as m . This suggests that we can rewrite the above using Taylor expansions, which is what we shall do next. Let us assume for a minute that Var(Gi (x) - Gj (x)) > 0 for some point x F0 ,i,j . One can verify that by the regularity conditions and the expression for G(·) in Eq. (4), the expression f0 ,i (·) - f0 ,j (·) V (15) ar(Gi (·) - Gj (·)) is twice differentiable, with a uniformly bounded second derivative. Therefore, we can rewrite the expression in Eq. (15) as its first-order Taylor expansion around each x F0 ,i,j , plus a remainder term which is uniform for all x: f0 ,i (x + y nx ) - f0 ,j (x + y nx ) V ar(Gi (x + y nx ) - Gj (x + y nx )) f y f0 ,i (x) - f0 ,j (x) 0 ,i (x) - f 0 ,j (x) V =V + nx + O(y 2 ). ar(Gi (x) - Gj (x)) ar(Gi (x) - Gj (x)) Since f0 ,i (x) - f0 ,j (x) = 0 for any x F0 ,i,j , the expression reduces after a simple calculation to (f0 ,i (x) - f0 ,j (x)) V y nx + O(y 2 ). ar(Gi (x) - Gj (x)) Notice that (f0 ,i (x) - f0 ,j (x)) (the gradient of f0 ,i (x) - f0 ,j (x)) has the same direction as nx (the normal to the cluster boundary). Therefore, the expression above can be rewritten, up to a sign, as + (f0 ,i (x) - f0 ,j (x)) V y O(y 2 ). ar(Gi (x) - Gj (x)) 8 As a result, denoting s(x) := (f0 ,i (x) - f0 ,j (x))/ m(f0 ,i (x + y nx ) - f0 ,j (x + y nx )) V ar(Gi (x + y nx ) - Gj (x + y nx )) 1 - m 1 V ar(Gi (x) - Gj (x)), we have that ( m(f0 ,i (x + y nx ) - f0 ,j (x + y nx )) V ar(Gi (x + y nx ) - Gj (x + y nx )) 16) - = s = m (x) y + O(y 2 ) s m (x) y 1 - s (x) y + O(y 2 ) + m s (x) y O( my 2 ). (17) In the preceding development, we have assumed that Var(Gi (x) - Gj (x)) > 0. However, notice that the expressions in Eq. (16) and Eq. (17), without the remainder term, are both equal (to zero) even if Var(Gi (x) - Gj (x)) = 0 (with ourreviouabuse of notation that (-) = 0, () = p s 1). oreover, since y takes values in [-r/ m, r/ m], the remainder term O( my 2 ) is at most M O( mr/m) = O(r/ m), so it can be rewritten as o(1) which converges to 0 as m . In conclusion, and again using Lemma D.2 to take the remainder terms outside of the integral, we can rewrite Eq. (14) as 2 F r / m 0 ,i,j -r / m m m s(x) y ) 1 - We now perform a change of variables, letting zx = 2 F r s(x) m s(x) y in the inner integral, and get m s(x) y ) p (x)dy dx + o(1). (18) 0 ,i,j -r s(x) 1 (zx ) (1 - (zx )) p(x)dzx dx + o(1), s(x) which is equal by the mean value theorem to r F - s(x0 ) p(x) dx 2 s(x) r s(x0 ) 0 ,i,j for some x0 F0 ,i,j . (zx0 ) (1 - (zx0 )) dzx0 + o(1) (19) By regularity condition 3b, it can be verified that s(x) is positive or infinite for any x F0 ,i,j . As a result, as r , we have that r - s(x0 ) 1 (zx0 )(1 - (zx0 ))dzx0 = . (zx0 ) (1 - (zx0 )) dzx0 - - r s(x0 ) and the convergence to 1/ is at a rate of O(exp(-r2 )). Combining this with Eq. (19) gives us the required result. Proof of Proposition D.1. We can now turn to prove Proposition D.1 itself. For any x X , let Ax be the event (over drawing and clustering a sample pair) that x switched clusters. For any F0 ,i,j m and sample size m, define F0 ,i,j to be the subset of F0 ,i,j , which is at a distance of at least m-1/4 from any other cluster boundary (with respect to 0 ). Formally, . x -1 / 4 m ,j = ) , x-y m F0 ,i,j := F0 ,i,j : ({i , j } = {i, j }, F0 ,i inf yF0 ,i ,j 9 Letting S1 , S2 be two independent samples of size m, we have by Fubini's theorem that Edm (Ak (S1 ), Ak (S2 ), Br/m (i,j F0 ,i,j )) D B B 1(Ax )p(x)dx = = mES1 ,S2 = B r / m (i,j F 0 ,i,j ) m Pr(Ax )p(x)dx r / m (i,j F 0 ,i,j ) m Pr(Ax )p(x)dx + m r / m (i,j F 0 ,i,j ) B m Pr(Ax )p(x)dx. m r / m (i,j F 0 ,i,j \F 0 ,i,j ) m As to the first integral, notice that each point in F0 ,i,j is separated from any point in any other m -1 / 4 m F0 ,i ,j by a distance of at least 2m . Therefore, for large enough m, Br/m (F0 ,i,j ) are disjoint for each i, j , and we can rewrite the above as: 1 i 0). ^ ^ ^ ^ This is simply by definition of Ax : the probability that under one clustering, based on a random sample, x is more associated with cluster i, and that under a second clustering, based on another independent random sample, x is more associated with cluster j . m In general, we will have more than two clusters. However, notice that any point x in Br/m (F0 ,i,j ) (for some i, j ) is much closer to F0 , than to any other cluster boundary. This is because its i,j distance to F0 ,i,j is on the order of 1/ m, while its distance to any other boundary is on the order of m-1/4 . Therefore, if x does switch clusters, then it is highly likely to switch between cluster i and cluster j . Formally, by regularity condition 3d (which ensure that the cluster boundaries experience at most O(1/ m) fluctuations), we have that uniformly for any x, Pr(Ax ) = 2 Pr(f,i (x) - f,j (x) < 0) Pr(f,i (x) - f,j > 0) + o(1), ^ ^ ^ ^ where o(1) converges to 0 as m . Substituting this back to Eq. (20), using Lemma D.2 to take the remainder term outside the integral, m and using the regularity condition 3c in the reverse direction to transform integrals over F0 ,i,j back into F0 ,i,j with asymptotically negligible remainder terms, we get that the quantity we are interested in can be written as B 1 m Pr(f,i (x) - f,j (x) < 0) Pr(f,i (x) - f,j > 0)p(x)dx + o(1). 2 ^ ^ ^ ^ i 0. We need the following variant of Hoeffding's bound, adapted to conditional probabilities. Lemma D.6. Fix some r > 0. Let X1 , . . . , Xq be real, nonnegative, independent and identically distributed random variables, such that Pr(X1 [0, r]) > 0. For any Xi , let Yi be a random variable on the same probability space, such that Pr(Yi = Xi |Xi [0, r]) = 1. Then for any > 0, Pr 1 q iq Xi - E[Y1 |X1 [0, r]] i, Xi [0, r] 2 exp - 2q 2 r2 . =1 Proof. Define an auxiliary set of random variables Z1 , . . . , Zq , such that Pr(Zi a) = Pr(Xi a|Xi [0, r]) for any i, a. In words, Xi and Zi have the same distribution conditioned on the event Xi [0, r]. Also, we have that Yi has the same distribution conditioned on Xi [0, r]. Therefore, E[Y1 |X1 [0, r]] = E[X1 |X1 [0, r]], and as a result E[Y1 |X1 [0, r]] = E[Z1 ]. Therefore, the probability in the lemma above can be written as Pr 1 q iq Zi - E[Zi ] , =1 where Zi are bounded in [0, r] with probability 1. Applying the regular Hoeffding's bound gives us the required result. 1 2 We now turn to the proof of the theorem. Let Am be the event that for all subsample pairs {Si , Si }, r 1 2 1 2 dm (Ak (Si ), Ak (Si ), Br/m(i,j F0 ,i,j ) ) = dm (Ak (Si ), Ak (Si )). Namely, this is the event that for D D all subsample pair the mass which switches clusters when we compare the two resulting clusterings s, is always in an r/ m-neighborhood of the limit cluster boundaries. Since p(·) is bounded, we have that dm (r) is deterministically bounded by O(r), with implicit D constants depending only on D and 0 . Using the law of total expectation, this implies that E = [dm (r)] - E[dm (r)|Am ] D D r =P m r(Am )E[dm (r)|Am ] + (1 - Pr(Am ))E[dm (r)|¬Am ] - E[dD (r)|Am ] r D r r D r r 1 E m - Pr(Am ) [dm (r)|¬Am ] - E[dD (r)|Am ] r D r r (1 - Pr(Am ))O(r). r (21) For any two events A, B , we have by the law of total probability that Pr(A) = Pr(B ) Pr(A|B ) + Pr(B c ) Pr(A|B c ). From this it follows that Pr(A) Pr(B ) + Pr(A|B c ). As a result, for any 11 > 0, > Pr m m,q - instab(Ak , D) ^k > + 1q i 1 2 dm (Ak (Si ), Ak (Si )) - instab(Ak , D) Pr D q =1 2 . q > 1i k m 1 2 Pr m m,q - instab(Ak , D) ^ d (Ak (Si ), Ak (Si )) - instab(Ak , D) q =1 D 2 (22) We will assume w.l.o.g that >/2 i < instab(Ak , D). Otherwise, we can upper bound k instab(Ak , D) m m,q - ^ Pr n the equation above by replacing with some smaller quantity for which /2 < instab(Ak , D,). We start by analyzing the conditional probability, forming the second summand in Eq. (22). Recall 1 2 that m,q , after clustering the q subsample pairs {Si , Si }q=1 , uses an additional i.i.d sample S 3 ^k i qm 1 2 dD (Ak (Si ), Ak (Si ))/ mq [0, 1]. This is achieved by of size m to empirically estimate calculating the average percentage of instances in S 3 which switches between clusterings. Thus, conditioned on the event appearing in the second summand of Eq. (22), m,q is simply an empirical ^k average of m i.i.d random variables in [0, 1], whose expected value, denoted as v , is a strictly positive su number in the range of (instab(Ak , D) ± /2)/ m. Thus, the second mmand of Eq. (22) refers to an event where this empirical average is at a distance of at least /(2 m) from its expected value. We can therefore apply a large deviation result to bound this probability. Since the expectation itself is a (generally decreasing) function of the sample size m, we will need something a bit stronger than the regular Hoeffding's bound. Using a relative entropy version of Hoeffding's bound [5], we have that the second summand in Eq. (22) is upper bounded by: v m 0 + - v , v - v - /2 + /2 ax , exp mDkl (23) exp mDkl m m m m where Dkl [p||q ] := -p log(p/q ) - (1 - p) log((1 - p)/(1 - q )) for any q (0, 1) and any p [0, 1]. Using the fact that Dkl [p||q ] (p - q )2 /2 max{p, q }, we get that Eq. (23) can be upper bounded by a quantity which converges to 0 as m . As a result, the second summand in Eq. (22) converges to 0 as m . As to the first summand in Eq. (22), using the triangle inequality and switching sides allows us to upper bound it by: 1q i m 1 2 Pr dD (Ak (Si ), Ak (Si )) - E[dm (r)|Am ] D r q =1 E -E ( m - [dm (r)|Ar ] - E[dm (r)] dm (r) - instab(Ak , D) 24) D D D 2 By the definition of instab(Ak , D) as appearing in Thm. 1 , and Proposition D.1, m lim Edm (r) - instab(Ak , D) = O(h(r)) = O(exp(-r2 )). D (25) Using Eq. (25) and Eq. (21), we can upper bound Eq. (24) by 1q i m 1 2 Pr d (Ak (Si ), Ak (Si )) - E[dm (r)|Am ] D r q =1 D , m - (1 - Pr(Ar ))O(r) - O(exp(-r2 )) - o(1) 2 12 (26) where o(1) 0 as m . Moreover, by using the law of total probability and Lemma D.6, we have that for any > 0, > 1q i m 1 2 m m d (Ak (Si ), Ak (Si )) - E[dD (r)|Ar ] Pr q =1 D > 1q A i 1 2 m m dm (Ak (Si ), Ak (Si )) - E[dm (r)|Am ] (1 - Pr(Am )) 1 + Pr(Ar ) Pr r D D r r q =1 - . 2q 2 m m (1 - Pr(Ar )) + 2 Pr(Ar ) exp (27) r2 1 2 m Lemma D.6 can be applied because dm (Ak (Si ), Ak (Si )) = dm (r) for any i, if Ar occurs. D D If m, r are such that - (1 - Pr(Am ))O(r) - O(exp(-r2 )) - o(1) > 0, r 2 (28) we can substitute this expression instead of in Eq. (27), and get that Eq. (26) is upper bounded by - 2. 2q 2 - (1 - Pr(Am ))O(r) - O(exp(-r2 ))) - o(1) r m m (1 - Pr(Ar )) + 2 Pr(Ar ) exp r2 (29) Let gm (r) := S1 , S 2 D m Pr (dm (r) = dm (Ak (S1 ), Ak (S2 ))) D D -3 - , g (r) = lim gm (r) m ) for some > 0. Also, we have that Pr(Am ) = By regularity condition 3d, g (r) = O(r r q m (1 - gm (r)) , and therefore limm Pr(Ar ) = (1 - g (r))q for any fixed q . In consequence, as m , Eq. (29) converges to - 2. 2q 2 - (1 - (1 - g (r))q )O(r) - O(exp(-r2 )) q q (1 - (1 - g (r))) ) + 2(1 - g (r)) exp r2 (30) Now we use the fact that r can be chosen arbitrarily. In particular, let r = q 1/(2+/2) , where > 0 is the same quantity appearing in condition 3d. It follows that q ( 3+ 1 - (1 - g (r))q q g (r) = O(q /r3+ ) = O 1- 2+/2 q = 2+ 1 - (1 - g (r))q )O(r) = q g (r)O(r) = O 1- 2+/2 O(q - 4+ ) 1 q /r2 = q 1- 1+/4 exp(-r2 ) = exp(-q 1+/4 ). It can be verified that the equations above imply the validness of Eq. (28) for large enough m and q (and hence r). Substituting these equations into Eq. (30), we get an upper bound - q- e + 2. q 3+ 1 1- 1+1 /4 1- 2+/2 - 4+ 1+ /4 ) -O O xp(-q exp 2q O 2 1 Summarizing, we have that the first summand in Eq. (22) converges to o(q -1/2 ) as m , and the second summand in Eq. (22) converge to 0 as m , for any fixed > 0, and thus Pr(| m m,q - ^k instab(Ak , D)| > ) converges to o(q -1/2 ). 13 Since > 0, it can be verified that the first summand asymptotically dominates the second summand (as q ), and can be bounded in turn by o(q -1/2 ). E Proof of Thm. 2 and Thm. 3 The tool we shall use for proving Thm. 2 and Thm. 3 is the following general central limit theorem for Z-estimators (Thm. 3.3.1 in [8]). We will first quote the theorem and then explain the terminology used. Theorem E.1 (Van der Vaart). Let m and be random maps and a fixed map, respectively, from a subset of some Banach space into another Banach space such that as m , m(m -1 )( ) - m(m - )( 0 ) ^ 0 (31) + m - 0 ^ in probability, and such that the sequence m(m - )( 0 ) converges in distribution to a tight random element Z . Let ( ) be Frechet-differentiable at 0 with an invertible derivative ´ ^ 0 , which is assumed to be a continuous linear operator1 . If ( 0 ) = 0 and m ( )/ m 0 ^ ^ in probability, and converges in probability to 0 , then m( - 0 ) converges in distribution to -1 Z . -0 A Banach space is any complete normed vector space (possible infinite dimensional). A tight random element essentially means that an arbitrarily large portion of its distribution lies in compact ´ sets. This condition is trivial when is a subset of Euclidean space. Frechet-differentiability of a function f : U V at x U , where U, V are Banach spaces, means that there exists a bounded linear operator A : U V such that h0 lim This is equivalent to regular differentiability in finite dimensional settings. It is important to note that the theorem is stronger than what we actually need, since we only consider finite dimensional Euclidean spaces, while the theorem deals with possibly infinite dimensional Banach spaces. In principle, it is possible to use this theorem to prove central limit theorems in infinite dimensional settings, for example in kernel clustering where the associated reproducing kernel Hilbert space is infinite dimensional. However, the required conditions become much less trivial, and actually fail to hold in some cases (see below for further details). We now turn to the proofs themselves. Since the proofs of Thm. 2 and Thm. 3 are almost identical, we will prove them together, marking differences between them as needed. In order to allow uniform notation in both cases, we shall assume that (·) is the identity mapping in Bregman divergence clustering, and the feature map from X to H in kernel clustering. With the assumptions that we made in the theorems, the only thing really left to show before applying Thm. E.1 is that Eq. (31) holds. Notice that it is enough to show that 1^ m(i - i )( ) - m(i - i )( 0 ) m m 0 + m - 0 ^ f (x + h) - f (x) - A(h) W = 0. hU for any i {1, . . . , k}. We will prove this in a slightly more complicated way than necessary, which also treats the case of kernel clustering where H is infinite-dimensional. By Lemma 3.3.5 in [8], since X is bounded, it is sufficient to show that for any i, there is some > 0 such that i i {,h (·) - 0 ,h (·)} ^ - 0 ,hX ^ is a Donsker class, where i ,h (x) = 0 i - (x), (h) x C,i otherwise. Intuitively, a set of real functions {f (·)} from X (with any probability distribution D) to R is called Donsker if it satisfies a uniform central limit theorem. Without getting too much into the details, 1 A linear operator is automatically continuous in finite dimensional spaces, not necessarily in infinite dimensional spaces. 14 We use the fact that if F and G are Donsker classes, then so are F + G and F · G (see examples 2.10.7 and 2.10.8 in [8]). This allows us to reduce the problem to showing that the following three function classes, from X to R, are Donsker: { i , (h) } -0 ,hX , { (·), (h) }hX , {1C,i (·)} -0 . (32) ^ ^ this means that if we sample i.i.d m elements from D, then (f (x1 ) + . . . + f (xm ))/ m converges in distribution (as m ) to a Gaussian random variable, and the convergence is uniform over all f (·) in the set, in an appropriately defined sense. Notice that the first class is a set of bounded constant functions, while the third class is a set of indicator functions for all possible clusters. One can now use several tools to show that each class in Eq. (32) is Donsker. For example, consider a class of real functions on a bounded subset of some Euclidean space. By Thm. 8.2.1 in [3] (and its preceding discussion), the class is Donsker if any function in the class is differentiable to a sufficiently high order. This ensures that the first class in Eq. (32) is Donsker, because it is composed of constant functions. As to the second class in Eq. (32), the same holds in the case of Bregman divergence clustering (where (·) is the identity function), because it is then just a set of linear functions. For finite dimensional kernel clustering, it is enough to show that { ·, (h) }hX is Donsker (namely, the same class of functions after performing the transformation from X to (X )). This is again a set of linear functions in Hk , a subset of some finite dimensional Euclidean space, and so it is Donsker. In infinite dimensional kernel clustering, our class of functions can be written as {k (·, h)}hX , where k (·, ·) is the kernel function, so it is Donsker if the kernel function is differentiable to a sufficiently high order. The third class in Eq. (32) is more problematic. By Theorem 8.2.15 in [3] (and its preceding discussion), it suffices that the boundary of each possible cluster is composed of a finite number of smooth surfaces (differentiable to a high enough order) in some Euclidean space. In Bregman divergence clustering, the clusters are separated by hyperplanes, which are linear functions (see appendix A in [1]), and thus the class is Donsker. The same holds for finite dimensional kernel clustering. This will still be true for infinite dimensional kernel clustering, if we can guarantee that any cluster in any solution close enough to 0 in will have smooth boundaries. Unfortunately, this does not hold in some important cases. For example, universal kernels (such as the Gaussian kernel) are capable of inducing cluster boundaries arbitrarily close in form to any continuous function, and thus our line of attack will not work in such cases. In a sense, this is not too surprising, since these kernels correspond to very 'rich' hypothesis classes, and it is not clear if a precise characterization of their stability properties, via central limit theorems, is at all possible. Summarizing the above discussion, we have shown that for the settings assumed in our theorem, all three classes in Eq. (32) are Donsker and hence Eq. (31) holds. We now return to deal with the other ingredients required to apply Thm. E.1. As to the asymptotic distribution of m(m - )( 0 ), since ( 0 ) = 0 by assumption, we have that for any i {1, . . . , k}, m 1j m(i - i )( 0 ) = i ( 0 , xj ). (33) m m =1 3) where x1 , . . . , xm is the sample by which m is defined. The r.h.s of Eq. (3 is a sum of identically distributed, independent random variables with zero mean, normalized by m. As a result, by the standard central limit theorem, m(i -i )( 0 ) converges in distribution to a zero mean Gaussian m random vector Y , with covariance matrix C Vi = p(x)((x) - 0,i )((x) - 0,i ) dx. 0 ,i Moreover, it is easily verified that Cov(i ( 0 , x), i ( 0 , x)) = 0 for any i = i . Therefore, m(m - )( 0 ) converges in distribution to a zero mean Gaussian random vector, whose covariance matrix V is composed of k diagonal blocks (V1 , . . . , Vk ), all other elements of V being zero. ^ Thus, we can use Thm. E.1 to get that m( - 0 ) converges in distribution to a zero mean Gaussian random vector of the form --01 Y , which is a Gaussian random vector with a covariance matrix of -1 V -1 . the form 0 0 15 F Proof of Thm. 4 Since our algorithm returns a locally optimal solution with respect to the differentiable loglikelihood function, we can frame it as a Z-estimator of the derivative of the log-likelihood function with respect to the parameters, namely the score function ^ m ( ) = m 1i ^ log(q (xi | )). m =1 This is a random mapping based on the sample x1 , . . . , xm . Similarly, we can define (·) as the 'asymptotic' score function with respect to the underlying distribution D: X ^ ^ log(q (x| ))p(x)dx. ( ) = ^ ^ Under the assumptions we have made, the model returned by the algorithm satisfies m ( ) = 0, ^ converges in probability to some 0 for which ( 0 ) = 0. The asymptotic normality of and ^ m( - 0 ) is now an immediate consequence of central limit theorems for 'maximum likelihood' Z-estimators, such as Thm. 5.21 in [7]. References [1] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. Journal of Machine Learning Research, 6:1705­1749, 2005. [2] P. Billingsley and F. Topsøe. Uniformity in weak convergence. Probability Theory and Related Fields, 7:1­16, 1967. [3] R. Dudley. Uniform Central Limit Theorems. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 1999. [4] G. R. Grimmet and D. R. Stirzaker. Probability and Random Processes. Oxford University Press, 2001. [5] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13­30, Mar. 1963. [6] R. R. Rao. Relations betwen weak and uniform convergence of measures with applications. The Annals of Mathematical Statistics, 33(2):659­680, June 1962. [7] A. W. V. D. Vaart. Asymptotic Statistics. Cambridge University Press, 1998. [8] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes : With Applications to Statistics. Springer, 1996. 16