Manifold Alignment using Pro crustes Analysis Chang Wang chwang@cs.umass.edu Sridhar Mahadevan mahadeva@cs.umass.edu Computer Science Department, University of Massachusetts, Amherst, MA 01003 USA Abstract In this paper we introduce a novel approach to manifold alignment, based on Procrustes analysis. Our approach differs from "semisupervised alignment" in that it results in a mapping that is defined everywhere ­ when used with a suitable dimensionality reduction method ­ rather than just on the training data points. We describe and evaluate our approach both theoretically and experimentally, providing results showing useful knowledge transfer from one domain to another. Novel applications of our method including cross-lingual information retrieval and transfer learning in Markov decision processes are presented. 1. Introduction Manifold alignment is very useful in a variety of applications since it provides knowledge transfer between two seemingly disparate data sets. Sample applications include automatic machine translation, representation and control transfer between different Markov decision processes (MDPs), image comparison, and bioinformatics. More precisely, suppose we have two data sets S1 = {x1 , · · · , xm } and S2 = {y1 , · · · , yn } for which we want to find a correspondence. Working with the data in its original form can be very difficult as the data might be in high dimensional spaces and the two sets might be represented by different features. For example, S1 could be a collection of English documents, whereas S2 is a collection of Arabic documents. Thus, it may be difficult to directly compare documents from the two collections. Even though the processing of high-dimensional data sets is challenging, for many cases, the data source may Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). only have a limited number of degrees of freedom, implying the data set has a low intrinsic dimensionality. Similar to current work in the field, we assume kernels for computing the similarity between data points in the original space are already given. In the first step, we map the data sets to low dimensional spaces reflecting their intrinsic geometries using a standard (nonlinear or linear) dimensionality reduction approach. For example, using a graph-based nonlinear dimensionality reduction method provides a discretized approximation to the manifolds, so the new representations characterize the relationships between points but not the original features. By doing this, we can compare the embeddings of the two sets instead of their original representations. Generally speaking, if two data sets S1 and S2 have similar intrinsic geometry structures, they have similar embeddings. In our second step, we apply Procrustes analysis to align the two low dimensional embeddings of the data sets based on a number of landmark points. Procrustes analysis, which has been used for statistical shape analysis and image registration of 2D/3D data (Luo et al., 1999), removes the translational, rotational and scaling components from one set so that the optimal alignment between the two sets can be achieved. There is a growing body of work on manifold alignment. Ham et al. (Ham et al., 2005) align the manifolds leveraging a set of correspondences. In their approach, they map the points of the two data sets to the same space by solving a constrained embedding problem, where the embeddings of the corresponding points from different sets are constrained to be identical. The work of Lafon et al. (Lafon et al., 2006) is based on a similar framework as ours. They use Diffusion Maps to embed the nodes of the graphs corresponding to the aligned sets, and then apply affine matching to align the resulting clouds of points. Our approach differs from semi-supervised alignment (Ham et al., 2005) in that it results in a mapping that is defined everywhere rather than just on the known data points (provided a suitable dimensionality Manifold Alignment using Pro crustes Analysis reduction method like LPP (He et al., 2003) or PCA is used). Recall that semi-supervised alignment is defined only on the known data points and it is hard to handle the new test points (Bengio et al., 2004). Our method is also faster, since it requires computing eigendecompositions of much smaller matrices. Compared to affine matching, which changes the shape of one given manifold to achieve alignment, our approach keeps the manifold shape untouched. This property preserves the relationship between any two data points in each individual manifold in the process of alignment. The computation times for affine matching and Procrustes analysis are similar, both run in O(N 3 ) (where N is the number of instances). Given the fact that dimensionality reduction approaches play a key role in our approach, we provide a theoretical bound for the difference between subspaces spanned by low dimensional embeddings of the two data sets. This bound analytically characterizes when the two data sets can be aligned well. In addition to the theoretical analysis of our algorithm, we also report on several novel applications of our alignment approach. The rest of this paper is as follows. In Section 2 we describe the main algorithm. In Section 3 we explain the rationale underlying our approach, and prove a bound on the difference between the subspaces spanned by low dimensional embeddings of the two data sets being aligned. We describe some novel applications and summarize our experimental results in Section 4. Section 5 provides some concluding remarks. dimensionality reduction. 1. Constructing the relationship matrices: · Construct the weight matrices W1 for S1 and W2 for S2 using Ki , where W1 (i, j ) = K1 (xi , xj ) and W2 (i, j ) = K2 (yi , yj ). · Compute Laplacian matrices L1 = I - - - - - D1 0.5 W1 D1 0.5 and L2 = I - D2 0.5 W2 D2 0.5 , where Dk is a diagonal matrix (Dk (i, i) = j Wk (i, j )) and I is the identity matrix. 2. Learning low dimensional emb eddings of the data sets: · Compute selected eigenvectors of L1 and L2 as the low dimensional embeddings of the data sets S1 and S2 . Let X , XU be the d l u dimensional embeddings of S1 and S1 , Y , YU l be the d dimensional embeddings of S2 and u l l S2 , where S1 , S2 are in pairwise alignment l l and |S1 |=|S2 |. 3. Finding the optimal alignment of X and Y : · Translate the configurations in X , XU , Y and YU , so that X , Y have their centroids |S l | |S l | l l ( i=11 Xi /|S1 |, i=21 Yi /|S2 |) at the origin. · Compute the singular value decomposition (SVD) of Y T X , that is U V T = SVD(Y T X ). · Y = k Y Q is the optimal mapping result that minimizes X - Y F , where . F is Frobenius norm, Q = U V T and k = trace()/trace(Y T Y ). 4. Apply Q and k to find corresp ondences b eu u tween S1 and S2 . · YU = k YU Q. 2. Manifold Alignment 2.1. The Problem Given two data sets along with additional pairwise correspondences between a subset of the training instances, we want to determine a correspondence between the remaining instances in the two data S ets. s l u Formally speaking, we have two sets: S1 = S1 1= Su l {x1 , · · · , xm }, S2 = S2 2 = {y1 , · · · , yn }, and the l l subsets S1 and S2 are in pairwise alignment. We want to find a mapping f , which is more precisely defined in Section 3.1, to optimally match the points between u u S1 and S2 . 2.2. The Algorithm Assume the kernel Ki for computing the similarity between data points in each of the two data sets is already given. The algorithmic procedure is stated below. For the sake of concreteness, in the procedure, Laplacian eigenmap (Belkin et al., 2003) is used for · For each element x in XU , its correspondence in YU = arg miny YU y - x . Depending on the approach that we want to use, there are several variations of Step 1. For example, if we are using PCA, then we use the covariance matrices instead of Laplacian matrices; similarly, if we are using LPP (He et al., 2003), then we construct the weight l l l l matrices W1 for D1 , W2 for D2 using Ki and then learn the pro jections. Note that when PCA or LPP is used, then the low dimensional embedding will be defined everywhere rather than just on the training points. Manifold Alignment using Pro crustes Analysis 3. Justification In this section, we prove two theorems. Theorem 1 shows why the algorithm is valid. Given the fact that dimensionality reduction approaches play a key role in our approach, Theorem 2 provides a theoretical bound for the difference between subspaces spanned by low dimensional embeddings of the two data sets. This bound analytically characterizes when the two data sets can be aligned well. 3.1. Optimal Manifold Alignment Procrustes analysis seeks the isotropic dilation and the rigid translation, reflection and rotation needed to best match one data configuration to another (Cox et al., 2001). Given low dimensional embeddings X and Y (defined in Section 2), the most convenient way to do translation is to translate the configurations in X and Y so that their centroids are at the origin. Then the problem is simplified as: finding Q and k so that X - k Y Q F is minimized, where · F is Frobenius norm. The matrix Q is orthonormal, giving a rotation and possibly a reflection, k is a re-scale factor to either stretch or shrink Y . Below, we show that the optimal solution is given by the SVD of Y T X . A detailed review of Procrustes analysis can be found in (Cox et al., 2001). Theorem 1: Let X and Y b e low dimensional emb eddings of the p oints with known corresp ondences in data set S1 , S2 , and Xi matches Yi for each i. If Singular Value Decomp osition (SVD) of Y T X is U V T , then Q = U V T and k = trace()/trace(Y T Y ) minimize X - k Y Q F . Pro of: The problem is formalized as: {kopt , Qopt } = arg mink,Q X - k Y Q F . (1.1) It is easy to verify that X - k Y Q 2 = trace(X T X ) + k 2 · trace(Y T Y ) - 2k · F trace(QT Y T X ). (1.2) Since trace(X T X ) is a constant, the minimization problem is equivalent to {kopt , Qopt } = arg mink,Q (k 2 · trace(Y T Y ) - 2k · trace(QT Y T X )). (1.3) Differentiating with respect to k , we 2k · trace(Y T Y ) = 2 · trace(QT Y T X ), i.e. k = trace(QT Y T X )/trace(Y T Y ). (1.4) have Case 1: If trace(QT Y T X ) 0, then the problem becomes Qopt = arg maxQ trace(QT Y T X ). (1.6) Using Singular Value Decomposition, we have Y T X = U V T , where U and V are orthonormal, and is a diagonal matrix having as its main diagonal all the positive singular values of Y T X . So maxQ trace(QT Y T X ) = maxQ trace(QT U V T ). (1.7) It is well known that for two matrices A and B , trace(AB ) = trace(B A), so maxQ trace(QT U V T ) = maxQ trace(V T QT U ). (1.8) For simplicity, we use Z to represent V T QT U . We know Q, U and V are all orthonormal matrices, so Z is also orthonormal. It is well known that any element in an orthonormal matrix, say B , is in [-1,1] (otherwise B T B is not an identity matrix). So we know trace(Z ) = Z1,1 1,1 + · · · + Zc,c c,c 1,1 + · · · + c,c (1.9) , which implies Z = I maximizes trace(Z ), where I is an identity matrix. (1.10) Obviously, the solution to Z = I is Q = U V T . (1.11) Case 2: If trace(QT Y T X ) < 0, then the problem becomes Qopt = arg minQ trace(QT Y T X ). (1.12) Following the similar procedure shown above, we have trace(Z ) = Z1,1 1,1 + · · · + Zc,c c,c -1,1 - · · · - c,c (1.13) , which implies that Z = -I minimizes trace(Z ). (1.14) Obviously, the solution to Z = -I is Q = -U V T . (1.15) Considering (1.5), it is easy to verify that Q = U V T and Q = -U V T return the same results, so Q = U V T is always the optimal solution to (1.5), no matter whether trace(QT Y T X ) is positive or not. Further, we can simplify (1.4), and have k = trace()/trace(Y T Y ). (1.16) 3.2. Theoretical Analysis Many dimensionality reduction approaches first compute a relationship matrix, and then pro ject the data onto a subspace spanned by the "top" eigenvectors of the matrix. The "top" eigenvectors mean some subset of eigenvectors that are of interest. They might be eigenvectors corresponding to largest, smallest, or (1.3) and (1.4) show that the minimization problem reduces to Qopt = arg maxQ (trace(QT Y T X ))2 . (1.5) Manifold Alignment using Pro crustes Analysis A is a N × N relationship matrix computed from S1 . B is a N × N relationship matrix computed from S2 . E = B - A. X denotes a subspace of the column space of A spanned by top M eigenvectors of A. Y denotes a subspace of the column space of B spanned by top M eigenvectors of B . X is a matrix whose columns are an orthonormal basis of X . Y is a matrix whose columns are an orthonormal basis of Y . 2 1 A is the set of top M eigenvalues of A, A includes all 1 eigenvalues of A except those in A . 1 2 B is the set of top M eigenvalues of B , B includes all 1 eigenvalues of B except those in B . 1 2 d1 is the eigengap between A and A , i.e. mini 1 ,j 2 |i - j |. A A 2 1 d = A - B . Pro of: From the definition of operator inorm, we know j E = maxk1 ,k2 ,···,kN ( kj Ei,j )2 , given i2 ki = 1. (2.1) We can i verify the followij g ini quality always n e j 2 2 holds: ( kj Ei,j )2 N kj Ei,j . (2.2) From (2.1) and (2.2), we have j2 N 2 2 kj = N 2 2 . (2.3) i ( j kj Ei,j )2 Combining (2.1) and (2.3), we have: E N . (2.4) It can be shown that if A and E are bounded self-adjoint operators on a separable Hilbert space, then the spectrum of A+E is in the closed E neighborhood of the spectrum of A (Kostrykin et al., 2003). From (Kostrykin et al., 2003), we also have the following inequality: Q P E /2d. (2.5) 1 We know A has an isolated part A of the spec2 trum separated from its remainder A by gap d1 . To guarantee A + E also has separated components, we need to assume E < d1 /2. Thus (2.5) becomes Q P E /2(d1 - E ). (2.6) 1 2 Interchanging the roles of A and A , we have the analogous inequality: QP E /2(d1 - E ). (2.7) d1 = P denotes the orthogonal pro jection onto subspace X . Q denotes the orthogonal pro jection onto subspace Y . · denotes Operator Norm, i.e. L max (x)=1 µ(Lx), where µ, are simply · 2. Figure 1. Notation used in Theorem 2. µ, = even arbitrary eigenvalues. One example is Laplacian eigenmap, where we pro ject the data onto the subspace spanned by the "smoothest" eigenvectors of the graph Laplacian. Another example is PCA, where we pro ject the data onto the subspace spanned by the "largest" eigenvectors of the covariance matrix. In this section, we study the general approach, which provides a general framework for each individual algorithm such as Laplacian eigenmap. We assume the two given data sets S1 and S2 do not differ significantly, so the related relationship matrices A and B are "very similar". We study the difference between the embedding subspaces corresponding to the two relationship matrices. Notation used in the proof is in Figure 1. The difference between orthogonal pro jections Q-P characterizes the distance between the two subspaces. The proof of the theorem below is based on the perturbation theory of spectral subspaces, where E = B -A can be thought as the perturbation to A. The only assumption we need to make is for any i and j , |Ei,j | = |Bi,j - Ai,j | . Theorem 2: If the absolute value of each element in E is b ounded by , and 2d1 /(N ( + 2)), then the difference b etween the two emb edding subspaces Q - P is at most . Since Q - P = max{ Q P , QP } (2.8), we have Q - P E /2(d1 - E ). (2.9) We define R = Q - P , and from (2.9), we get R E /2(d1 - E ). (2.10) (2.10) implies that if R . (2.11) E 2d1 /(2 + ), then So we have the following conclusion: if the absolute value of each element in E is bounded by , and 2d1 /(N ( + 2)), then the difference of the subspaces spanned by top M eigenvectors of A and B is at most . 1 Theorem 2 tells us that if the eigengap (between A 2 and A ) is large, then the subspace corresponding to the top M eigenvectors of A is insensitive to perturbations. In other words, the algorithm can tolerate larger differences between A and B . So when we are selecting eigenvectors to form a subspace, the eigengap is an important factor to be considered. The reasoning behind this is that if the magnitudes of the relevant eigenval- Manifold Alignment using Pro crustes Analysis ues do not change too much, the top M eigenvectors will not be overtaken by other eigenvectors, thus the related space is more stable. Our result in essence connects the difference between the two relationship matrices to the difference between the subspaces spanned by their low dimensional embeddings. 4. Applications and Results In this section, we first use a toy example to illustrate how our algorithm works, then we apply our approach to transfer knowledge from one domain to another. We present results applying our approach to two real world problems: cross-lingual information retrieval and transfer learning in Markov decision processes (MDPs). 4.1. A Toy Example In this example, we directly align two manifolds and use some pictures to illustrate how our algorithm works. The two manifolds come from real protein tertiary structure data. Protein 3D structure reconstruction is an important step in Nuclear Magnetic Resonance (NMR) protein structure determination. Basically, it finds a map from distances to coordinates. A protein 3D structure is a chain of amino acids. Let n be the number of amino acids in a given protein and C1 , · · · , Cn be the coordinate vectors for the amino acids, where Ci = (Ci,1 , Ci,2 , Ci,3 )T and Ci,1 , Ci,2 , and Ci,3 are the x, y , z coordinates of amino acid i (in biology, one usually uses atom but not amino acid as the basic element in determining protein structure. Since the number of atoms is huge, for simplicity, we use amino acid as the basic element). Then the distance di,j between amino acids i and j can be defined as di,j = Ci - Cj . Define A = {di,j , i, j = 1, · · · , n}, and C = {Ci , i = 1, · · · , n}. It is easy to see that if C is given, then we can immediately compute A. However, if A is given, it is non-trivial to compute C . The latter problem is called Protein 3D structure reconstruction. In fact, the problem is even more tricky, since only the distances between neighbors are reliable, and this makes A an incomplete distance matrix. The problem has been proved to be NP-complete for general sparse distance matrices (Hogben, 2006). In real life, people use other techniques, such as angle constraints and human experience, together with the partial distance matrix to determine protein structures. With the information available to us, NMR techniques might find multiple estimations (models), since more than one configuration can be consistent with the dis- tance matrix and the constraints. Thus, the result is an ensemble of models, rather than a single structure. Most usually, the ensemble of structures, with perhaps 10 - 50 members, all of which fit the NMR data and retain good stereochemistry is deposited with the Protein Data Bank (PDB) (Berman et al., 2000). Models related to the same protein should be similar and comparisons between the models in this ensemble provides some information on how well the protein conformation was determined by NMR. In this test, we study a Glutaredoxin protein PDB1G7O (this protein has 215 amino acids in total), whose 3D structure has 21 models. Since such models are already low dimensional (3D) embeddings of the distance matrices, we skip Step 1 and 2 in our algorithm. We pick up Model 1 and Model 21 for test. These two models are related to the same protein, so it makes sense to treat them as manifolds to test our techniques. We denote Model 1 by Manifold A, which is represented by matrix S1 . We denote Model 21 by Manifold B , which is represented by matrix S2 . Obviously, both S1 and S2 are 215 × 3 matrices. To evaluate our re-scale factor, we manually stretch manifold A by letting S1 =4 · S1 . Manifold A and B (row vectors of S1 and S2 represent points in the 3D space) are shown in Figure 2(A) and Figure 2(B). In biology, such chains are called protein backbones. For the purpose of comparison, we also plot both manifolds on the same graph (Figure 2(C)). It is clear that manifold A is much larger than B , and the orientations of A and B are quite different. To align the two manifolds, we uniformly selected 1/4 amino acids as correspondence resulting in matrix X and Y , where row i of X (from S1 ) matches row i of Y (from S2 ) and both X and Y are 54 × 3 matrices. We run our algorithm from Step 3. Our algorithm identifies the re-scale factor k as 4.2971, and the rotation matrix Q as 0 . Q= .56151 0.65793 -0.50183 -0.53218 0.75154 0.38983 0.63363 0.048172 0.77214 S2 , the new representation of S2 , is computed as S2 = k S2 Q. We plot S2 and S1 in the same graph (Figure 2(D)). The result shows that Manifold B is rotated and enlarged to the similar size as A, and now the two manifolds are aligned very well. 4.2. Cross-lingual Information Retrieval In information retrieval, manifold alignment can be used to find correspondences between documents. One example is finding the exact correspondences between documents in different languages. Such systems are quite useful, since they allow users to query a docu- Manifold Alignment using Pro crustes Analysis Manifold A Manifold B 100 50 0 -50 -100 200 100 0 -50 -100 -100 50 0 40 30 20 our approach to learn the re-scale factor k and rotation Q from the training correspondences and then apply them to the untranslated set. In our experiments, we used two document collections (one in English, one in Arabic, manually translated), each of which has 2119 documents. Correspondences between 25% of them were given and used to learn the mapping between them. The remaining 75% were used for testing. We used Laplacian eigenmap and LPP (the pro jection was learned from the data points in the correspondence) to learn the low dimensional embeddings, where top 100 eigenvectors were used to construct the embeddings. Our testing scheme is as follows: for each given Arabic document, we retrieve its top j most similar English documents. The probability that the true match is among this top j documents is used to show the goodness of the method. We also used the same data set to test the semi-supervised manifold alignment method proposed in (Ham et al., 2005), where top 100 eigenvectors were used for low dimensional embeddings. A fourth method (called baseline method) was also tested. The baseline method is as follows: assume that we have m correspondences in the training set, then document x is represented by a vector V with length m, where V (i) is the similarity of x and the ith document in the training correspondences. The baseline method maps the documents from different collections to the same embedding space - Rm . Experiment results are shown in Figure 3. 0.9 0.8 0.7 Probability of Matching 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 j 6 Procrustes Analysis (Laplacian) Procrustes Analysis (LPP) Baseline Semi-supervised Manifold Alignment Z Z 10 0 -10 40 20 0 -20 -20 -40 20 0 Y X Y X (A) Comparison of Manifold A and B (Before Alignment) (B) Comparison of Manifold A and B (After Alignment) 100 50 0 -50 -100 200 100 0 -50 -100 -100 50 0 150 100 50 Z Z 0 -50 -100 200 100 0 -50 -100 -100 50 0 Y X Y X (C) (D) Figure 2. (A): Manifold A; (B): Manifold B ; (C): Comparison of Manifold A(red) and B (blue) before alignment; (D): Comparison of Manifold A(red) and B (blue) after alignment. ment in their native language and retrieve documents in a foreign language. Assume that we are given two document collections. For example, one in English and one in Arabic. We are also given some training correspondences between documents that are exact translations of each other. The task is: for each English or Arabian document in the untranslated set, to find the most similar document in the other corpus. We apply our manifold alignment approach to this problem. The topical structure of each collection can be thought as a manifold over documents. Each document is a sample from the manifold. We are interested in the case where the underlying topical manifolds of two languages are similar. Our procedure for aligning collections consists of two steps: learning low dimensional embeddings of the two manifolds and aligning the low dimensional embeddings. To compute similarity of two documents in the same collection, we assume that document vectors are language models (multinomial term distributions) estimated using the document text. By treating documents as probability distributions, we can use distributional affinity to detect topical relatedness between documents. More precisely, a multinomial diffusion kernel is used for this particular application. The kernel used here is the same as the one used in (Diaz et al., 2007), where more detailed description is provided. Dimensionality reduction approaches are then used to learn the low dimensional embeddings. After shifting the centroids of the documents in each collection to the origin point, we apply 7 8 9 10 Figure 3. Cross-lingual information retrieval test. Compared to semi-supervised manifold alignment method, the performance of Prucrustes (with Laplacian eigenmap) is significantly better. For each given Arabic document, if we retrieve 3 most relevant English documents, then the true match has a 60% probability of being among the 3. If we retrieve 10 most relevant English documents, then we have about 80% probability of getting the true match. Further, our method is much faster. Semi-supervised manifold Manifold Alignment using Pro crustes Analysis alignment method requires solving an eigenvalue problem over a (n1 + n2 - m) × (n1 + n2 - m) matrix, where ni is the total number of the documents in collection i, and m is the number of training correspondences. Using our approach, the most time consuming step is finding the low dimensional embeddings with Laplacian eigenmap, which requires solving eigenvalue problems over a n1 × n1 matrix and a n2 × n2 matrix. We also compute the SVD over a d × d matrix, where d is the dimension of the low dimensional embeddings and is usually much smaller than n. In the experiments, Procrustes (with Laplacian eigenmap) is roughly 2 times faster than semi-supervised manifold alignment. Procrustes (with LPP) also returns reasonably good results: if we retrieve 10 most relevant English documents, then we have a 60% probability of getting the true match. Procrustes (with LPP) results in a mapping that is defined everywhere rather than just on the training data points and it also requires less time. Another interesting result is that the baseline algorithm also performs quite well, and better than semi-supervised alignment method. One reason that semi-supervised manifold alignment method is not working well is that mappings of the corresponding points are constrained to be identical. This might lead to "over fitting" problems for some applications. 4.3. Transfer Learning in Markov Decision Pro cess Transfer learning studies how to re-use knowledge learned from one domain or task to a related domain or task. In this section, we investigate transfer learning in Markov decision processes (MDPs) following the approach of "proto-value functions" (PVFs), where the Laplacian eigenmap method is used to construct basis functions (Mahadevan, 2005). In a MDP, a value function is a mapping from states to real numbers, where the value of a state represents the long-term reward achieved starting from that state, and executing a particular policy. PVFs are an orthonormal basis spanning all value functions of an MDP on a state space manifold. They are computed as follows: First, create a weight matrix that reflects the topology of the state space using a series of random walks; Second, compute the graph Laplacian of the weight matrix; Third, select the smoothest k eigenvectors of this graph Laplacian as PVFs. If the state space is the same and only the reward function is changed, then the PVFs can be directly transferred to the new domain. One interesting question related to PVFs is how to transfer the old PVFs to a new domain when the new state space is only slightly different from the old one. In this section, we answer this question with our techniques. Let columns of Y denote PVFs of the current MDP. Given the procedure on how to generate PVFs, we know the rows of Y are also the low dimensional representations of the data points on the current state space manifold. Let rows of X represent the low dimensional embedding of the new manifold. Assume centroids of both X and Y are at the origin. By using isotropic dilation, reflection and rotation to align the two state space manifolds, we may find the optimal k and Q such that the two manifolds are aligned well. Our argument is that the new PVFs are Y Q. The reason is as follows: suppose we have already found the optimal k and Q that minimize X - k Y Q F , then Y will be changed to k Y Q in the process of alignment. k can be skipped, since it is well known that k Y Q and Y Q span the same space. The only thing that we need to show is the columns of Y Q are orthonormal to each other (a requirement of PVFs). The proof is quite simple: (Y Q)T Y Q = QT Y T Y Q = QT I Q = I , where I is an identity matrix. This means different columns of Y Q are orthogonal to each other and norm of each column is 1, so Y Q is orthonormal. The conclusion shown above works when two state space manifolds are similar. Here, we still need to answer one more question: "under what conditions are the two manifolds similar?". Theorem 2 provides an answer to this question. Theorem 2 numerically bounds the difference between two spaces given the difference between the relevant relationship matrices. For this case, the relationship matrices are the Laplacian matrices used to model the state spaces. In this test, we run experiments to verify the bound. We investigate two reinforcement learning tasks. The inverted pendulum task requires balancing a pendulum of unknown mass and length by applying force to a cart attached to the pendulum. The state space is defined by two variables: the vertical angle of the pendulum, and the angular velocity of the pendulum. The mountain car task is to get a simulated car to the top of a hill as quickly as possible. The car does not have enough power to get there immediately, and so must oscillate on the hill to build up the necessary momentum. The state space is the position and velocity of the car. We first generate two different sets of sampled states for the pendulum task and compute their related normalized graph Laplacian matrices A and B . We compute the top i non-trivial eigenvectors of A and B , and directly compute the difference between the spaces spanned by them. Theorem 2 says if the absolute value of each element in A - B is bounded by , and 2d1 /(N ( + 2)), then the difference of the spaces spanned by top i eigenvectors of A and B is at most Manifold Alignment using Pro crustes Analysis Acknowledgments We thank the reviewers for their helpful comments. This pro ject was supported in part by the National Science Foundation under grant IIS-0534999. References (A) Pendulum Task Belkin, M., Niyogi, P. (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15. Bengio, Y. et al. (2004) Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering. NIPS 16. Berman, H. M., Westbrook, J., Feng, Z., Gillilandand, G., Bhat, T. N. Weissig, H., Shindyalov, I. N., Bourne, P. E. (2000) The protein data bank. Nucleic Acids Research, 28:235­242. Cox, M. F., Cox, M. A. A. (2001) Multidimensional scaling. Chapman and Hall. Diaz, F., Metzler, D. (2007) Pseudo-aligned multilingual corpora. The International Joint Conference on Artificial Intel ligence(IJCAI) 2007. 2727-2732. Ham, J., Lee, D., Saul, L. (2005) Semisupervised alignment of manifolds. 10th International Workshop on Artificial Intel ligence and Statistics. 120-127. He, X., Niyogi, P. (2003) Locality preserving pro jections. The Annual Conference on Neural Information Processing Systems (NIPS) 16. Hogben, L. (2006) Handbook of linear algebra. Chapman/Hall CRC Press. Kostrykin, V., Makarov, K. A., Motovilov, A. K. (2003) On a subspace perturbation problem. Proc. of the American Mathematical Society. 131:34693476. Lafon, S., Keller, Y., Coifman, R. R. (2006) Data fusion and multi-cue data matching by diffusion maps. IEEE transactions on Pattern Analysis and Machine Intel ligence. 28(11):1784-1797. Luo, B., Hancock, W.R. (1999) Feature matching with Procrustes alignment and graph editing. 7th International Conference on Image Processing and its Applications. Mahadevan, S. (2005) Proto-value functions: developmental reinforcement learning. The 22nd International Conference on Machine Learning (ICML). (B) Mountain Car Task Figure 4. (A): Bound for Pendulum task. (B): Bound for Mountain car task. For both tasks, is 0.5, true values (M ax and M in in 5 tests) of the difference between two spaces are in dotted lines. . We set be 0.5, and let be d1 /(N ( + 2)). Here d1 is the eigengap between top i eigenvectors and the other eigenvetors, N is 500. Based on our theorem, the difference between spaces should not be larger than . In our experiments, we tried 20 different values for i=1, 6, 11, · · ·, 96. For each i, we ran 5 tests. We carried out the same experiment on the Mountain Car task. Figure 4(A) and 4(B) respectively show the results from Pendulum task and Mountain car task. For each figure, we plot and the maximum and minimum difference values of the 5 tests for various values of i. For this application, the bound is loose, but the bound given in Theorem 2 is a general theoretical bound and for other applications, it might be tight. We also empirically evaluate the PVFs transfer performance. The results (not included) show that we can learn a good policy by using PVFs from a similar domain. 5. Conclusions In this paper we introduce a novel approach to manifold alignment based on Procrustes Analysis. When used with a suitable dimensionality reduction method, our approach results in a mapping defined everywhere rather than just on the training data points. We also study the conditions under which low dimensional embeddings of two data sets can be aligned well. We presented novel applications of our approach, including cross-lingual information retrieval and transfer learning in Markov decision processes.