An Analysis of the Convergence of Graph Laplacians Daniel Ting Department of Statistics, University of California, Berkeley, CA 94720 USA Ling Huang Intel Labs Berkeley, 2150 Shattuck Avenue, Berkeley, CA 94704 USA dting@stat.berkeley.edu ling.huang@intel.com Michael I. Jordan jordan@cs.berkeley.edu Department of EECS and Department of Statistics, University of California, Berkeley, CA 94720 USA Abstract Existing approaches to analyzing the asymptotics of graph Laplacians typically assume a well-behaved kernel function with smoothness assumptions. We remove the smoothness assumption and generalize the analysis of graph Laplacians to include previously unstudied graphs including kNN graphs. We also introduce a kernel-free framework to analyze graph constructions with shrinking neighborhoods in general and apply it to analyze locally linear embedding (LLE). We also describe how, for a given limit operator, desirable properties such as a convergent spectrum and sparseness can be achieved by choosing the appropriate graph construction. 1. Introduction Graph Laplacians have become a core technology throughout machine learning. In particular, they have appeared in clustering (Kannan et al., 2004; von Luxburg et al., 2008), dimensionality reduction (Belkin & Niyogi, 2003; Nadler et al., 2006), and semi-supervised learning (Zhu, 2008). While graph Laplacians are but one member of a broad class of methods that use local neighborhood graphs to model high-dimensional data lying on a lowdimensional manifold, they are distinguished by their appealing mathematical properties, notably: (1) the graph Laplacian is the infinitesimal generator for a random walk on the graph, and (2) it is a discrete approximation to a weighted Laplace-Beltrami operAppearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). ator on a manifold, an operator which has numerous geometric properties and induces a smoothness functional. These mathematical properties have served as a foundation for the development of a growing theoretical literature that has analyzed learning procedures based on the graph Laplacian. To review briefly, Bousquet et al. (2003) proved an early result for the convergence of the unnormalized graph Laplacian to a smoothness functional that depends on the squared density p2 . Belkin & Niyogi (2005) demonstrated the pointwise convergence of the empirical unnormalized Laplacian to the Laplace-Beltrami operator on a compact manifold with uniform density. Lafon (2004) and Nadler et al. (2006) established a connection between graph Laplacians and the infinitesimal generator of a diffusion process. They further showed that one may use the degree operator to control the effect of the density. Hein et al. (2005) combined and generalized these results for weak and pointwise (strong) convergence under weaker assumptions, as well as providing rates for the unnormalized, normalized, and random walk Laplacians. They also make explicit the connections to the weighted Laplace-Beltrami operator. Singer (2006) obtained improved convergence rates for a uniform density. Gin´ & Koltchinskii (2005) established e a uniform convergence result and functional central limit theorem to extend the pointwise convergence results. von Luxburg et al. (2008) and Belkin & Niyogi (2006) presented convergence results for the eigenvectors of graph Laplacians in the fixed and shrinking bandwidth cases respectively. Although this burgeoning literature has provided many useful insights, several gaps remain between theory and practice. Most notably, in constructing the neighborhood graphs underlying the graph Laplacian, several choices must be made, including the choice of algorithm for constructing the graph, with k-nearest- An Analysis of the Convergence of Graph Laplacians neighbor (kNN) and kernel functions providing the main alternatives, as well as the choice of parameters (k, kernel bandwidth, normalization weights). These choices can lead to the graph Laplacian generating fundamentally different random walks and approximating different weighted Laplace-Beltrami operators. The existing theory has focused on one specific choice in which graphs are generated with smooth kernels with shrinking bandwidths. But a variety of other choices are often made in practice, including kNN graphs, rneighborhood graphs, and the "self-tuning" graphs of Zelnik-Manor & Perona (2004). Surprisingly, few of the existing convergence results apply to these choices (see Maier et al. (2008) for an exception). This paper provides a general theoretical framework for analyzing graph Laplacians and operators that behave like Laplacians. Our point of view differs from that found in the existing literature; specifically, our point of departure is a stochastic process framework that utilizes the characterization of diffusion processes via drift and diffusion terms. This yields a general kernel-free framework for analyzing graph Laplacians with shrinking neighborhoods. We use it to extend the pointwise results of Hein et al. (2007) to cover non-smooth kernels and introduce location-dependent bandwidths. Applying these tools we are able to identify the asymptotic limit for a variety of graphs constructions including kNN, r-neighborhood, and "selftuning" graphs. We are also able to provide an analysis for Locally Linear Embedding (Roweis & Saul, 2000). A practical motivation for our interest in graph Laplacians based on kNN graphs is that these can be significantly sparser than those constructed using kernels, even if they have the same limit. Our framework allows us to establish this limiting equivalence. On the other hand, we can also exhibit cases in which kNN graphs converge to a different limit than graphs constructed from kernels, thereby explaining some cases where kNN graphs perform poorly. Moreover, our framework allows us to generate new algorithms: in particular, by using location-dependent bandwidths we obtain a class of operators that have nice spectral convergence properties that parallel those of the normalized Laplacian in von Luxburg et al. (2008), but which converge to a different class of limits. make the connection to the drift and diffusion terms of a diffusion process. This allows us to present a kernelfree framework for analysis of graph Laplacians as well as giving a better intuitive understanding of the limit diffusion process. We first give a brief overview of these connections and present our general framework for the asymptotic analysis of graph Laplacians as well as providing some relevant background material. We then introduce our assumptions and derive our main results on the limit operator for a wide range of graph construction methods. We use these to calculate asymptotic limits for some specific graph constructions. 2.1. Relevant Differential Geometry Assume M is a smooth m-dimensional manifold embedded in Rb , the extrinsic coordinate system. To identify the asymptotic infinitesimal generator of a diffusion process on this manifold, we will derive the drift and diffusion terms in normal coordinates at each point. We refer the reader to Boothby (1986) for an exact definition of normal coordinates. For our purposes it suffices to note that the normal coordinates are coordinates in Rm that behave roughly as if a neighborhood of x was projected onto the tangent plane at x. To link the extrinsic coordinates of a point y in a neighborhood of x and normal coordinates s, we have the relation y - x = Hx s + Lx (ssT ) + O( s3 ), (1) where Hx is a linear isomorphism between the normal coordinates in Rm and the m-dimensional tangent plane Tx at x. Lx is a linear operator describing the curvature of the manifold and takes m × m positive semidefinite matrices into the space orthogonal to the tangent plane, Tx . More advanced readers will note that this statement is Gauss' lemma and Hx and Lx are related to the first and second fundamental forms. We are most interested in limits involving the operator defined below: 2. The Framework Our work exploits the connections among diffusion processes, elliptic operators (in particular the weighted Laplace-Beltrami operator), and stochastic differential equations (SDEs). This builds upon the diffusion process viewpoint in Nadler et al. (2006). Critically, we Definition 1 (Weighted Laplace-Beltrami operator). Given a smooth manifold M, the weighted LaplaceBeltrami operator with respect to the density q is the T second-order differential operator q := M - q q where M := div is the unweighted LaplaceBeltrami operator. For functions f C (M) with support contained in the interior of the manifold, it induces a smoothness functional by the relationship f, q f L(q) = f 2 L2 (q) . (2) An Analysis of the Convergence of Graph Laplacians 2.2. Equivalence of Limiting Characterizations We now establish the connections among elliptic operators, diffusions, SDEs, and graph Laplacians. We first show that elliptic operators define diffusion processes and SDEs and vice versa. An elliptic operator G is a second-order differential operator of the form Gf (x) = ij process defined on a compact set S Rb , and let G be (n) the corresponding infinitesimal generator. Let {Yt }t be Markov chains with transition matrices Pn on state spaces {xi }n for all n, and let cn define a sei=1 quence of scalings. Put µn (xi ) =cn E(Y1 ^ 3 (n) - xi |Y0 (n) = xi ) aij (x) f (x) + xi xj 2 bi (x) i f (x) +c(x)f (x), xi (n) (n) n (xi )^n (xi )T =cn Var(Y1 |Y0 ^ = xi ). Let f C (S). If for all > 0 sup µn (xi ) - µ(xi ) ^ in where the m × m coefficient matrix (aij (x)) is positive semidefinite for all x. If we use normal coordinates for a manifold, we see that the weighted Laplace-Beltrami operator q is a special case of an elliptic operator with (aij (x)) = I, the identity matrix, b(x) = q(x) , q(x) and c(x) = 0. Diffusion processes are related via a result by Dynkin which states that, given a diffusion process, the generator of the process is an elliptic operator. The (infinitesimal) generator G of a diffusion process Xt is defined as Ex f (Xt ) - f (x) Gf (x) := lim t0 t when the limit exists and convergence is uniform over x. Here Ex f (Xt ) = E(f (Xt )|X0 = x). A converse relation holds as well. The Hille-Yosida theorem characterizes when a linear operator, such as an elliptic operator, is the generator of a stochastic process. We refer the reader to Kallenberg (2002) for details. A time-homogeneous stochastic differential equation (SDE) defines a diffusion process as a solution (when one exists) to the equation dXt = µ(Xt )dt + (Xt )dWt , where Xt is a diffusion process taking values in Rd . The terms µ(x) and (x)(x)T are the drift and diffusion terms of the process. By Dynkin's result, the generator G of this process is an elliptic operator and a simple calculation shows the operator is Gf (x) = 1 2 (x)(x)T ij ij 0, 0, sup n (xi )^n (xi )T - (xi )(xi )T ^ in cn sup P in Y1 (n) - xi > Y0 (n) = xi 0, then for the generators An = cn (Pn - I), we have An f Gf . We remark that though the results we have discussed thus far are stated in the context of the extrinsic coordinates Rb , we describe appropriate extensions in terms of normal coordinates in Ting et al. (2010). 2.3. Our Assumptions We describe here the assumptions and notation for the rest of the paper. We will refer to the following assumptions as the standard assumptions. Assume M is a smooth m-dimensional manifold isometrically embedded in Rb . We further assume for simplicity that the manifold is compact without boundary, but describe weaker conditions in Ting et al. (2010). Unless stated explicitly otherwise, let f be an arbitrary function in C 3 (M). Assume points {xi } are sampled i.i.d. from a deni=1 sity p C 3 (M) with respect to the natural volume element of the manifold, and assume that p is bounded away from zero. For brevity, we will always use x, y Rb to be points on M expressed in extrinsic coordinates and let s Rm denote the normal coordinates for y in a neighborhood centered at x. Since they represent the same point, we will also use y and s interchangeably as function arguments; i.e., f (y) = f (s). Whenever we take a gradient, it is with respect to the normal coordinates. We define what we mean by convergence. When we write gn g where domain(gn ) = Xn M, we mean gn - n g 0 where n g = g|Xn is the restriction of g to Xn . For operators Tn on functions with domain Xn , we take Tn g = Tn n g. Convergence of operators Tn T means Tn f T f for all f C 3 (M). f (x) + xi xj 2 µi (x) i f (x) . xi All that remains then is to connect diffusion processes in continuous space to graph Laplacians on a finite set of points. Diffusion approximation theorems provide this connection. We state one version of such a theorem, which may be derived from Theorems 1.2.4, 1.6.3, and 7.4.2 in Ethier & Kurtz (1986). Theorem 2 (Diffusion Approximation). Let µ(x) and (x)(x)T be drift and diffusion terms for a diffusion An Analysis of the Convergence of Graph Laplacians We now introduce our assumptions on the graph construction methods. Let K0 : R+ R+ be a base kernel with bounded variation and compact support. Let (n) hn be a sequence of bandwidth scalings and rx (·) > (n) 0, wx (·) 0 be (possibly random) location dependent bandwidth and weight functions that converge to rx (·) and wx (·), respectively, and have Taylor-like expansions for all x, y M with x - y < hn : (n) rx (y) = rx (x) + (rx (x) + x sign(uT s)ux )T s x We sketch the proof here and present the details in Ting et al. (2010). First, assuming we are given the true density p and limit weight and bandwidth functions, we calculate the limits assuming K0 is an indicator kernel. To generalize to kernels of bounded variation and compact support, note that K0 (x) = I(|x| < z)d+ (z) - I(|x| < z)d- (z) for some finite positive measures - , + with compact support. The result for general kernels then follows from Fubini's theorem. The key calculation is establishing that integrating against an indicator kernel is like integrating over a sphere re-centered on h2 rx (x). Given this calculation n and by Taylor expanding the non-kernel terms, one obtains the infinitesimal first and second moments and the degree operator: M1 (x) = (n) + (n) (x, s) r (n) wx (y) = wx (x) + wx (x) s + T (n) (x, s) w where the approximation error is uniformly bounded: (n) sup |i (x, s)| = o(h2 ) for i = r, w. n xM, s m and reconstruction error = 0. We will assume this and analyze the matrix M = W - I. Suppose LLE produces a sequence of matrices Mn = I - Wn . The row sums of each matrix are zero. Thus, we may decompose Mn = A+ - A- where n n A+ , A- are infinitesimal generators for finite state n n Markov processes obtained from the positive and negative weights respectively. Assume that there is some scaling cn such that cn A+ , cn A- converge to generan n tors of diffusion processes with drifts µ+ , µ- and difT T fusion terms + + , - - . Set µ = µ+ - µ- and T = + + - - - . Since reconstruction error in extrinsic coordinates is From this we see that: (1) LLE can only behave like an unweighted Laplace-Beltrami operator since the drift term is zero. (2) LLE is affected by the curvature of the manifold since s (x)s (x)T must lie in the null space of Lx . Furthermore, when Lx is full rank then s = 0 and LLE may behave unlike any elliptic operator (including the Laplace-Beltrami operator). (3) LLE has, in general, no well-defined asymptotic limit without additional conditions on the weights, since the diffusion term is free to vary in the null space of Lx . We note that while the LLE framework of minimizing reconstruction error can yield ill-behaved solutions, practical implementations add a regularization term when constructing the weights. This causes the reconstruction error to be non-zero in general and gives unique solutions for the weights that favor equal weights (and hence asymptotic behavior akin to that of kNN graphs). 4. Experiments To illustrate the theory, we show how to correct the bad behavior of the kNN Laplacian for a synthetic data set. We also show how our analysis can predict the surprising behavior of LLE. kNN Laplacian. We first consider an example which almost all non-linear embedding techniques handle well but where the kNN graph Laplacian performs poorly. Figure 1 shows a 2D manifold embedded in three dimensions and shows embeddings using different graph constructions. The theoretical limit of the normalized Laplacian Lknn for a kNN graph is 1 Lknn = p 1 , while the limit for a graph with Gaussian weights is Lgauss = p . The first two coordinates of each point are from a truncated normal distribution, so the density at the boundary is small and the effect of the 1/p term is substantial. This yields the bad behavior shown in Figure 1 (C). We may use Eq. (5) as a pilot estimate p of the density. ^ Choosing wx (y) = pn (x)^n (y) gives a weighted kNN ^ p graph with the same limit as the graph with Gaussian weights. Figure 1(D) shows that this change yields the roughly desired behavior but with fewer "holes" in low density regions and more "holes" in high density regions. LLE. We consider another synthetic data set, the toroidal helix, in which the manifold structure is easy to recover. Figure 2(A) shows the manifold which is clearly isomorphic to a circle, a fact picked up by the An Analysis of the Convergence of Graph Laplacians (A) Gaussian Manifold (B) Kernel Laplacian embedding 0.06 0.05 (C) Raw kNN Laplacian Embedding 0.06 0.04 0.02 (D) rescaled kNN Laplacian Embedding 0.15 0.1 0.05 0.04 0.02 0 -0.02 -0.04 1 0 -1 -2 0 2 0 0 -0.02 -0.04 -0.06 -0.06 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 -0.05 -0.05 0 0.05 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 Figure 1. (A) shows a 2D manifold where the x and y coordinates are drawn from a truncated standard normal distribution. (B-D) show embeddings using different graph constructions. (B) uses a normalized Gaussian kernel d(x)K(x,y) 1/2 , (C) 1/2 d(y) p ^ p uses a kNN graph, and (D) uses a kNN graph with edge weights p(x)^(y). The bandwidth for (B) was chosen to be the median standard deviation from taking 1 step in the kNN graph in order to have comparable sparsity. (A) Toroidal helix 0.05 (B) Laplacian (C) LLE w/ regularization 1e-3 1.5 1 2 1.5 1 0.5 (D) LLE w/ regularization 1e-9 1 0.5 0.5 0 -0.5 -1 2 0 -2 -2 2 0 -0.05 -0.05 0 0.05 0 0 0 -0.5 -1 -1.5 -2 -1.5 -1 -0.5 0 0.5 1 -0.5 -1 -1.5 -2 -1 0 1 2 Figure 2. (A) shows a 1D manifold isomorphic to a circle. (B-D) show the embeddings using (B) Laplacian eigenmaps which correctly identifies the structure, (C) LLE with regularization 1e-3, and (D) LLE with regularization 1e-6. kNN Laplacian in Figure 2(B). Our theory predicts that the heuristic argument that LLE behaves like the Laplace-Beltrami operator will not hold. Since the total dimension for the drift and diffusion terms is two and the extrinsic coordinates also have dimension three, that there is forced cancellation of the first- and second-order differential terms and the operator should behave like the zero operator or include higher-order differentials. In Figure 2(C) and (D), we see this that LLE performs poorly and that the behavior comes closer to the zero operator when the regularization term is smaller. Laplacian matrix are of particular interest since they may be used as a set of harmonic basis functions or in spectral clustering. With our more general graph constructions we can use the theory of compact integral operators to obtain graph Laplacians that (1) have eigenvectors that converge for fixed (non-decreasing) bandwidth scalings and (2) converge to a limit that is different from that of previously analyzed normalized Laplacians when the bandwidth decreases to zero. In particular, for arbitrary q, g C 3 (M) with g bounded away from zero. We may choose w, r such that, if hn 0 at an appropriate rate, we obtain limits of the form q -cn L(n) f g -1/2 q (g -1/2 f ), norm p with corresponding smoothness functional q f, g -1/2 q (g -1/2 f ) p = (g -1/2 f ) L2 (p) 2 L2 (q) 5. Remarks Convergence rates. We note that one missing element in our analysis is the derivation of convergence rates. For the main theorem, we note that although we have employed a diffusion approximation theorem it is not in fact necessary to use such a theorem. Since our theorem still uses a kernel (albeit one with much weaker conditions), a virtually identical proof can be obtained by applying a function f and Taylorexpanding it. Thus, we believe that similar convergence rates to those in Hein et al. (2007) can be obtained. Also, while our convergence result is stated for the strong operator topology, the same conditions as in Hein should give weak operator convergence. Eigenvalues/eigenvectors. Eigenvectors of the , and if hn = h1 is constant then the eigen(n) vectors of Lnorm converge in the sense given by von Luxburg et al. (2008). 6. Conclusions We have introduced a general framework that enables us to analyze a wide class of graph Laplacian constructions. Our framework reduces the problem of graph Laplacian analysis to the calculation of a An Analysis of the Convergence of Graph Laplacians mean and variance (or drift and diffusion) for any graph construction method with positive weights and shrinking neighborhoods. Our main theorem extends existing strong operator convergence results to nonsmooth kernels, and introduces a general locationdependent bandwidth function. The analysis of a location-dependent bandwidth function, in particular, significantly extends the family of graph constructions for which an asymptotic limit is known. This family includes the previously unstudied (but commonly used) kNN graph constructions, unweighted r-neighborhood graphs, and "self-tuning" graphs. Our results also have practical significance in graph constructions as they suggest graph constructions that (1) produce sparser graphs than those constructed with the usual kernel methods, despite having the same asymptotic limit, and (2) in the fixed bandwidth regime, produce unnormalized Laplacians or normalized Laplacians that have well-behaved spectra but converge to a different class of limit operators than previously studied normalized and unnormalized Laplacians. In particular, this class of limits include those 2 that induce the smoothness functional f L2 (q) for almost any density q. The graph constructions may also (3) have better connectivity properties in lowdensity regions. Acknowledgements. We would like to thank Martin Wainwright and Bin Yu for their helpful comments, and our anonymous reviewers for the detailed and helpful review. Gin´, E. and Koltchinskii, V. Empirical graph Laplae cian approximation of Laplace-Beltrami operators: large sample results. In 4th International Conference on High Dimensional Probability, 2005. Hein, M., Audibert, J., and Von Luxburg, U. From graphs to manifolds - weak and strong pointwise consistency of graph Laplacians. In COLT, 2005. Hein, M., Audibert, J.-Y., and von Luxburg, U. Graph Laplacians and their convergence on random neighborhood graphs. JMLR, 8:1325­1370, 2007. Kallenberg, O. Foundations of Modern Probability. Springer Verlag, 2002. Kannan, R., Vempala, S., and Vetta, A. On clusterings: Good, bad and spectral. Journal of the ACM, 51(3):497­515, 2004. Lafon, S. Diffusion Maps and Geometric Harmonics. PhD thesis, Yale University, CT, 2004. Loftsgaarden, D.O. and Quesenberry, C.P. A nonparametric estimate of a multivariate density function. Annals of Mathematical Statistics, 36(3):1049­1051, 1965. Maier, M., von Luxburg, U., and Hein, M. Influence of graph construction on graph-based clustering measures. In NIPS 21, 2008. Nadler, B., Lafon, S., Coifman, R., and Kevrekidis, I. Diffusion maps, spectral clustering and reaction coordinates of dynamical systems. In Applied and Computational Harmonic Analysis, 2006. Roweis, S. T. and Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290 (5500):2323, 2000. Singer, A. From graph to manifold Laplacian: the convergence rate. Applied and Computational Harmonic Analysis, 21(1):128­134, 2006. Ting, D., Huang, L., and Jordan, M. An analysis of the convergence of graph Laplacians. Technical report, University of California, Berkeley, 2010. von Luxburg, U., Belkin, M., and Bousquet, O. Consistency of spectral clustering. Annals of Statistics, 36(2):555­586, 2008. Zelnik-Manor, L. and Perona, P. Self-tuning spectral clustering. In NIPS 17, 2004. Zhu, Xiaojin. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin Madison, 2008. References Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373­1396, 2003. Belkin, M. and Niyogi, P. Towards a theoretical foundation for Laplacian-based manifold methods. In COLT. Springer, 2005. Belkin, M. and Niyogi, P. Convergence of Laplacian eigenmaps. In NIPS 19, 2006. Boothby, W. M. An Introduction to Differentiable Manifolds and Riemannian Geometry. Academic Press, 1986. Bousquet, O., Chapelle, O., and Hein, M. Measure based regularization. In NIPS 16, 2003. Devroye, L.P. and Wagner, T.J. The strong uniform consistency of nearest neighbor density estimates. The Annals of Statistics, pp. 536­540, 1977. Ethier, S. and Kurtz, T. Markov Processes: Characterization and Convergence. Wiley, 1986.