Convergence and Rate of Convergence of A Manifold-Based Dimension Reduction Algorithm Andrew K. Smith, Xiaoming Huo School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA 30332 andrewsmith81@gmail.com, huo@gatech.edu Hongyuan Zha College of Computing Georgia Institute of Technology Atlanta, GA 30332 zha@cc.gatech.edu Abstract We study the convergence and the rate of convergence of a local manifold learning algorithm: LTSA [13]. The main technical tool is the perturbation analysis on the linear invariant subspace that corresponds to the solution of LTSA. We derive a worst-case upper bound of errors for LTSA which naturally leads to a convergence result. We then derive the rate of convergence for LTSA in a special case. 1 Introduction Manifold learning (ML) methods have attracted substantial attention due to their demonstrated potential. Many algorithms have been proposed and some work has appeared to analyze the performance of these methods. The main contribution of this paper is to establish some asymptotic properties of a local manifold learning algorithm: LTSA [13], as well as a demonstration of some of its limitations. The key idea in the analysis is to treat the solutions computed by LTSA as invariant subspaces of certain matrices, and then carry out a matrix perturbation analysis. Many efficient ML algorithms have been developed including locally linear embedding (LLE) [6], ISOMAP [9], charting [2], local tangent space alignment (LTSA) [13], Laplacian eigenmaps [1], and Hessian eigenmaps [3]. A common feature of many of these manifold learning algorithms is that their solutions correspond to invariant subspaces, typically the eigenspace associated with the smallest eigenvalues of a kernel or alignment matrix. The exact form of this matrix, of course, depends on the details of the particular algorithm. We start with LTSA for several reasons. First of all, in numerical simulations (e.g., using the tools offered by [10]), we find empirically that LTSA performs among the best of the available algorithms. Second, the solution to each step of the LTSA algorithm is an invariant subspace, which makes analysis of its performance more tractable. Third, the similarity between LTSA and several other ML algorithms (e.g., LLE, Laplacian eigenmaps and Hessian eigenmaps) suggests that our results may generalize. Our hope is that this performance analysis will provide a theoretical foundation for the application of ML algorithms. The rest of the paper is organized as follows. The problem formulation and background information are presented in Section 2. Perturbation analysis is carried out, and the main theorem is proved (Theorem 3.7) in Section 3. Rate of convergence under a special case is derived in Section 4. Some discussions related to existing work in this area are included in Section 5. Finally, we present concluding remarks in Section 6. 1 2 Manifold Learning and LTSA We formulate the manifold learning problem as follows. For a positive integer n, let yi I D , i = R 1, 2, . . . , n, denote n observations. We assume that there is a mapping f : I d I D which satisfies R R a set of regularity conditions (detailed in the next subsection). In addition, we require another set of (possibly multivariate) values xi I d , d < D, i = 1, 2, . . . , n, such that R yi = f (xi ) + i , D i = 1, 2, . . . , n, 2 (1) where i I R denotes a random error. For example, we may assume i N (0, ID ); i.e., a multivariate normal distribution with mean zero and variance-covariance proportional to the identity matrix. The central questions of manifold learning are: 1) Can we find a set of low-dimensional vectors such that equation (1) holds? 2) What kind of regularity conditions should be imposed on f ? 3) Is the model well defined? These questions are the main focus of this paper. 2.1 A Pedagogical Example (a) Embedded Spiral 3 2.5 2 (b) Noisy Observations 0.05 0.04 (c) Learned vs. Truth 0.03 0.02 0.01 0 3.5 3 2.5 2 1.5 1 0.5 0 1 0.5 0 !0.5 !1 !1 0 !0.5 0.5 1 1.5 1 0.5 0 !0.5 1 0.5 0 !0.5 !1 !1 0 !0.5 !0.05 0 0.2 0.4 0.6 0.8 1 !0.03 !0.01 !0.02 1 0.5 !0.04 Figure 1: An illustrative example of LTSA in nonparametric dimension reduction. The straight line pattern in (c) indicates that the underlying parametrization has been approximately recovered. An illustrative example of dimension reduction that makes our formulation more concrete is given in Figure 1. Subfigure (a) shows the true underlying structure of a toy example, a 1-D spiral. The noiseless observations are equally spaced points on this spiral. In subfigure (b), 1024 noisy observations are generated with multivariate noise satisfying i N (0, 110 I3 ). We then apply LTSA to 0 the noisy observations, using k = 10 nearest neighbors. In subfigure (c), the result from LTSA is compared with the true parametrization. When the underlying parameter is faithfully recovered, one should see a straight line, which is observed to hold approximately in subfigure (c). 2.2 Regularity and Uniqueness of the Mapping f If the conditions on the mapping f are too general, the model in equation (1) is not well defined. For example, if the mapping f (·) and point set {xi } satisfy (1), so do f (A-1 (· - b)) and {Axi + b}, where A is an invertible d by d matrix and b is a d-dimensional vector. As being common in the manifold-learning literature, we adopt the following condition on f . Condition 2.1 (Local Isometry) The mapping f is locally isometric: For any > 0 and x in the domain of f , let N (x) = {z : z - x 2 < } denote an -neighborhood of x using Euclidean distance. We have f (x) - f (x0 ) 2 = x - x0 2 + o( x - x0 2). The above condition indicates that in a local sense, f preserves Euclidean distance. Let J (f ; x0 ) denote the Jacobian of f at x0 . We have J (f ; x0 ) I D×d , where each column (resp., row) of R J (f ; x0 ) corresponds to a coordinate in the feature (resp., data) space. The above in fact implies the following lemma [13]. Lemma 2.2 The matrix J (f ; x0 ) is orthonormal for any x0 , i.e., J T (f ; x0 )J (f ; x0 ) = Id . 2 Given the previous condition, model (1) is still not uniquely defined. For example, for any d by d orthogonal matrix O and any d-dimensional vector b, if f (·) and {xi } satisfy (1) and Condiition 2.1, so do f (OT (· - b)) and {Oxi + b}. We can force b to be 0 by imposing the condition that xi = 0. In dimension reduction, we can consider the sets {xi } and {Oxi } "invariant," because one is just a rotation of the other. In fact, the invariance coincides with the concept of "invariant subspace" to be discussed. Condition 2.3 (Local Linear Independence Condition) Let Yi I D×k , 1 i n, denote a R matrix whose columns are made by the ith observation yi and its k - 1 nearest neighbors. We choose k - 1 neighbors so that the matrix Yi has k columns. It is generally assumed that d < k . For any 1 i n, the rank of Yi P k is at least d; in other words, the dth largest singular value of matrix Yi P k is greater than 0. 1 In the above, we use the projection matrix P k = Ik - k · 1k 1T , where Ik is the k by k identity matrix k and 1k is a k -dimensional column vector of ones. The regularity of the manifold can be determined by the Hessians of the mapping. Rewrite f (x) for x I d as f (x) = (f1 (x), f2 (x), . . . , fD (x))T . R Furthermore, let x = (x1 , . . . , xd )T . The Hessian is a D by D matrix, [Hi (f ; x)]j k = 2 fi (x) , xj xk 1 i D, 1 j, k d. The following condition ensures that f is locally smooth. We impose a bound on all the components of the Hessians. Condition 2.4 (Regularity of the Manifold) |[Hi (f ; x)]j k | C1 for all i, j , and k , where C1 > 0 is a prescribed constant. 2.3 Solutions as Invariant Subspaces and a Related Metric We now give a more detailed discussion of invariant subspaces. Let R(X ) denote the subspace spanned by the columns of X . Recall that xi , i = 1, 2, . . . , n, are the true low-dimensional representations of the observations. We treat the xi 's as column vectors. Let X = (x1 , x2 , · · · , xn )T ; i.e., the ith row of X corresponds to xi , 1 i n. If the set {Oxi }, where O is a d by d orthogonal square matrix, forms another solution to the dimension reduction problem, we have (Ox1 , Ox2 , · · · , Oxn )T = X OT . It is evident that R(X OT ) = R(X ). This justifies the invariance that was mentioned earlier. The goal of our performance analysis is to answer the following question: Letting tan(·, ·) 2 denote the Euclidean norm of the vector of canonical angles between two invariant subspaces ([8, Section I.5]), and letting X and X denote the true and estimated parameters, respectively, how do we evaluate tan(R(X), R(X )) 2? 2.4 LTSA: Local Tangent Space Alignment We now review LTSA. There are two main steps in the LTSA algorithm [13]. 1. The first step is to compute the local representation on the manifold. Recall the projection matrix P k . It is easy to verify that P k = P k · P k , which is a characteristic of projection matrices. We solve the minimization problem: min,V Yi P k - V F , where I D×d , V I d×k , and R R V V T = Id . Let Vi denote optimal V . Then the row vectors of Vi are the d right singular vectors of Yi P k . 2. The solution to LTSA corresponds to the invariant subspace which is spanned and determined by the eigenvectors associated with the 2nd to the (d + 1)st smallest eigenvalues of the matrix T (S1 , . . . , Sn )diag(P k - V1T V1 , . . . , P k - Vn Vn )(S1 , . . . , Sn )T . (2) T where Si I n×k is a selection matrix such that Y T Si = Yi , where Y = (y1 , y2 , . . . , yn ) . R 3 As mentioned earlier, the subspace spanned by the eigenvectors associated with the 2nd to the (d + 1)st smallest eigenvalues of the matrix in 2 is an invariant subspace, which will be analyzed using matrix perturbation techniques. We slightly reformulated the original algorithm as presented in [13] for later analysis. 3 Perturbation Analysis We now carry out a perturbation analysis on the reformulated version of LTSA. There are two steps: in the local step (Section 3.1), we characterize the deviation of the null spaces of the matrices P k - ViT Vi , i = 1, 2, . . . , n. In the global step (Section 3.2), we derive the variation of the null space under global alignment. 3.1 Local Coordinates Let X be the matrix of true parameters. We define Xi = X T Si = (x1 , x2 , · · · , xn )Si ; i.e., the columns of Xi are made by xi and those xj 's that correspond to the k - 1 nearest neighbors of yi . We require a bound on the size of the local neighborhoods defined by the Xi 's. Condition 3.1 (Universal Bound on the Sizes of Neighborhoods) For all i, 1 i n, we have i < , where is a prescribed constant and i is an upper bound on the distance between two columns of Xi : i = maxxj ,xk xj - xk , where the maximum is taken over all columns of Xi . In this paper, we are interested in the case when 0. We will need conditions on the local tangent spaces. Let dmin,i (respectively, dmax,i ) denote the minimum (respectively, maximum) singular values of Xi P k . Let dmin = min dmin,i , 1in dmax = max dmax,i . 1in We can bound dmax as dmin dmax k [5]. Condition 3.2 (Local Tangent Space) There exists a constant C2 > 0, such that C2 · dmin . (3) The above can roughly be thought of as requiring that the local dimension of the manifold remain constant (i.e., the manifold has no singularities.) The following condition defines a global bound on the errors (i ). Condition 3.3 (Universal Error Bound) There exists > 0, such that i, 1 i n, we have yi - f (xi ) < . Moreover, we assume = o( ); i.e., we have 0, as 0. It is reasonable to require that the error bound ( ) be smaller than the size of the neighborhood ( ), which is reflected in the above condition. Within each neighborhood, we give a perturbation bound between an invariant subspace spanned by the true parametrization and the invariant subspace spanned by the singular vectors of the matrix of noisy observations. Let Xi P k = Ai Di Bi be the singular value decomposition of the matrix Xi P k ; here Ai I d×d is orthogonal (Ai AT = Id ), Di I d×d is diagonal, and the rows of Bi I d×k R R R i T are the right singular vectors corresponding to the largest singular values (Bi Bi = Id ). It is not hard to verify that Bi = Bi P k . (4) Let Yi P k = Ai Di Bi be the singular value decomposition of Yi P k , and assume that this is the (0) "thin" decomposition of rank d. We may think of this as the perturbed version of J (f ; xi )Xi P k . T The rows of Bi are the eigenvectors of (Yi P k ) (Yi P k ) corresponding to the d largest eigenvalues. T T Let R(Bi ) (respectively, R(Bi )) denote the invariant subspace that is spanned by the columns of T B T ). matrix Bi (respectively, i 4 T T Theorem 3.4 Given invariant subspaces R(Bi ) and R(Bi )) as defined above, we have , T T lim sin(R(Bi ), R(Bi )) 2 C3 + C1 0 where C3 is a constant that depends on k , D and C2 . The proof is presented in [5]. The above gives an upper bound on the deviation of the local invariant subspace in step 1 of the modified LTSA. It will be used later to prove a global upper bound. 3.2 Global Alignment Condition 3.5 (No Overuse of One Observation) There exists a constant C4 , such that n i C4 . Si =1 Note that we must have C4 k . The next condition (Condition 3.6) will implicitly give an upper bound on C4 . Recall that the quantity n 1 Si is the maximum row sum of the absolute values of the entries i= n in i=1 Si . The value of n 1 Si is equal to the maximum number of nearest neighbor subsets i= to which a single observation belongs. We will derive an upper bound on the angle between the invariant subspace spanned by the result of LTSA and the space spanned by the true parameters. T Given (4), it can be shown that Xi P k (P k - Bi Bi )(Xi P k )T = 0. Recall X = (x1 , x2 , . . . , xn )T T n×d I R . It is not hard to verify that the row vectors of (1n , X ) span the (d + 1)-dimensional null space of the matrix: T T (S1 , . . . , Sn )P k diag(I - B1 B1 , . . . , I - Bn Bn )P k (S1 , . . . , Sn )T . (5) 1 Assume that ( n , X, (X c ))T is orthogonal, where X c I n×(n-1-d) . Although in our original R n problem formulation, we made no assumption about the xi 's, we can still assume that the columns of X are orthonormal because we can transform any set of xi 's into an orthonormal set by rescaling the columns and multiplying by an orthogonal matrix. Based on the previous paragraph, we have 1T 1 = 0 ( n n 0(d+1)×(n-d-1) n (d+1)×(d+1) c T 6) X Mn , X , X 0(n-d-1)×(d+1) L2 n (X c )T where T T Mn = (S1 , . . . , Sn )P k diag(Ik - B1 B1 , . . . , Ik - Bn Bn )P k (S1 , . . . , Sn )T and L2 = (X c )T Mn X c . Let min denote the minimum singular value (i.e., eigenvalue) of L2 . We will need the following condition on min . Condition 3.6 (Appropriateness of Global Dimension) min > 0 and min goes to 0 at a slower rate than + 1 C1 ; i.e., as 0, we have 2 1 · n 1 Si i= + 2 C1 0. min As discussed in [12, 11], this condition is actually related to the amount of overlap between the nearest neighbor sets. 5 Theorem 3.7 (Main Theorem) 0 lim tan(R(X ), R(X )) 2 C3 ( + C1 ) · min n1 i= Si . (7) As mentioned in the Introduction, the above theorem gives a worst-case bound on the performance of LTSA. For proofs as well as a discussion of the requirement that 0 see [7]. A discussion on when Condition 3.6 is satisfied will be long and beyond the scope of this paper. We leave it to future investigation. We refer to [5] for some simulation results related to the above analysis. 4 A Preliminary Result on the Rate of Convergence We discuss the rate of convergence for LTSA (to the true underlying manifold structure) in the aforementioned framework. We modify the LTSA (mainly on how to choose the size of the nearest neighborhood) for a reason that will become evident later. We assume the following result regarding the relationship between k , min , and (this result can be proved for xi being sampled on a uniform grid, using the properties of biharmonic eigenvalues for partial differential equations) holds: + min C (k ) · min (2 ) · 4 , (8) + where min (2 ) is a constant, and C (k ) k 5 . We will address such a result in the more general context in the future. So far, we have assumed that k is constant. However, allowing k to be a function of the sample size n, say k = n , where [0, 1) allows us to control the asymptotic behavior of min along with the convergence of the estimated alignment matrix to the true alignment matrix. Consider our original bound on the angle between the true coordinates and the estimated coordinates: 0 lim tan(R(X ), R(X )) 2 C3 ( + C1 ) · min n 1 Si i= . Now, set k = n , where [0, 1) is an exponent, the value of which will be decided lter. We must a kD be careful in disregarding constants, since they may involve k . We have that C3 = C2 . C1 and C2 are fundamental constants not involving k . Further, it is easy to see that n 1 Si is O(k ) i= since each point has k neighbors, the maximum number of neighborhoods to which a point belongs is of the same order as k . Now, we can use a simple heuristic to estimate the size of , the neighborhood size. For example, suppose we fix and consider -neighborhoods. For simplicity, assume that the parameter space is the unit hypercube [0, 1]d , where d is the intrinsic dimension. The law of large numbers tells us that -1 k d · n. Thus we can approximate as O(n d ). Plugging all this into the original equation and dropping the constants, we get 0 lim tan(R(X ), R(X )) 2 n -1 d ·n min 3 2 · Constant. If we conjecture that the relationship in (8) holds in general (i.e., the generating coordinates can follow a more general distribution rather than only lying in a uniform grid), then we have n d · n 2 · n lim tan(R(X ), R(X )) 2 · Constant. -1 0 n5 · n4· d Now the exponent is a function only of and the constant d. We can try to solve for such that the convergence is as fast as possible. Simplifying the exponents, we get 0 -7 -1 lim tan(R(X ), R(X )) 2 n 2 -3( d ) · Constant. -1 As a function of restricted to the interval [0, 1), there is no minimum--the exponent decreases with , and we should choose close to 1. 6 However, in the proof of the convergence of LTSA, it is assumed that the errors in the local step converge to 0. This error is given by k D · [ + 1 C1 2 ] T T 2 sin(R(Bi ), R(Bi )) 2 . C2 · - k D · [ + 1 C1 2 ] 2 Thus, our choice of is restricted by the fact that the RHS of this equation must still converge to 0. Disregarding constants and writing this as a function of n, we get -1 2-2 . n d -n2 ·n d This quantity converges to 0 as n if and only if we have 2 - 2 -1 2 + < < . 2 d d d+2 Note that this bound is strictly less than 1 for all positive integers d, so our possible choices of are restricted further. n2 ·n 2-2 d By the reasoning above, we want the exponent to be as large as possible. Further, it is easy to see 2 that for all d, choosing an exponent roughly equal to d+2 will always yield a bound converging to 0. The following table gives the optimal exponents for selected values of d along with the convergence rate of lim 0 tan(R(X ), R(X )) 2. In general, using the optimal value of , the convergence -4 rate will be roughly n d+2 . Table 1: Convergence rates for a few values of the underlying dimension d. d 1 2 3 4 5 Optimal 0.66 0.5 0.4 0.33 0.29 Convergence rate -1.33 -1 -0.8 -0.66 -0.57 Thesis [7] presents some numerical experiments to illustrate the above results. Associated with each fixed value of k , there seems to be a threshold value of n, above which the performance degrades. This value increases with k , though perhaps at the cost of worse performance for small n. However, we expect from the above analysis that, regardless of the value chosen, the performance will eventually become unacceptable for any fixed k . 5 Discussion To the best of our knowledge, the performance analysis that is based on invariant subspaces is new. Consequently the worst-case upper bound is the first of its kind. There are still open questions to be addressed (Section 5.1). In addition to a discussion on the relation of LTSA to existing dimension reduction methodologies, we will also address relation with known results as well (Section 5.2). 5.1 Open Questions The rate of convergence of min is determined by the topological structure of f . It is important to estimate this rate of convergence, but this issue has not been addressed here. We did not address the correctness of (8) at all. It turns out the proof of (8) is quite nontrivial and tedious. We assume that 0. One can imagine that it is true when the error bound ( ) goes to 0 and when the xi 's are sampled with a sufficient density in the support of f . An open problem is how to derive the rate of convergence of 0 as a function of the topology of f and the sampling scheme. After doing so, we may be able to decide where our theorem is applicable. 5.2 Relation to Existing Work The error analysis in the original paper about LTSA is the closest to our result. However, Zhang and Zha [13] do not interpret their solutions as invariant subspaces, and hence their analysis does not yield a worst case bound as we have derived here. 7 Reviewing the original papers on LLE [6], Laplacian eigenmaps [1], and Hessian eigenmaps [3] reveals that their solutions are subspaces spanned by a specific set of eigenvectors. This naturally suggests that results analogous to ours may be derivable as well for these algorithms. A recent book chapter [4] stresses this point. After deriving corresponding upper bounds, we can establish different proofs of consistency than those presented in these papers. ISOMAP, another popular manifold learning algorithm, is an exception. Its solution cannot immediately be rendered as an invariant subspace. However, ISOMAP calls for MDS, which can be associated with an invariant subspace; one may derive an analytical result through this route. 6 Conclusion We derive an upper bound of the distance between two invariant subspaces that are associated with the numerical output of LTSA and an assumed intrinsic parametrization. Such a bound describes the performance of LTSA with errors in the observations, and thus creates a theoretical foundation for its use in real-world applications in which we would naturally expect such errors to be present. Our results can also be used to show other desirable properties, including consistency and rate of convergence. Similar bounds may be derivable for other manifold-based learning algorithms. References [1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373­1396, 2003. [2] M. Brand. Charting a manifold. In Neural Information Processing Systems, volume 15. Mitsubishi Electric Research Labs, MIT Press, March 2003. [3] D. L. Donoho and C. E. Grimes. Hessian eigenmaps: New locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Arts and Sciences, 100:5591­5596, 2003. [4] X. Huo, X. S. Ni, and A. K. Smith. Mining of Enterprise Data, chapter A survey of manifoldbased learning methods. Springer, New York, 2005. Invited book chapter, accepted. [5] X. Huo and A. K. Smith. Performance analysis of a manifold learning algorithm in dimension reduction. Technical report, Georgia Institute of Technology, March 2006. Downloadable at www2.isye.gatech.edu/statistics/papers/06-06.pdf, to appear in Linear Algebra and Its Applications. [6] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323­2326, 2000. [7] A. K. Smith. New results in dimension reduction and model selection. Ph.D. Thesis. Available at http://etd.gatech.edu, 2008. [8] G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. Academic Press, Boston, MA, 1990. [9] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319­2323, 2000. [10] T. Wittman. MANIfold learning Matlab demo. http://www.math.umn.edu/wittman/mani/index.html, April 2005. URL: [11] H. Zha and H. Zhang. Spectral properties of the alignment matrices in manifold learning. SIAM Review, 2008. [12] H. Zha and Z. Zhang. Spectral analysis of alignment in manifold learning. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. [13] Z. Zhang and H. Zha. Principal manifolds and nonlinear dimension reduction via local tangent space alignment. SIAM Journal of Scientific Computing, 26(1):313­338, 2004. 8