Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems Le Song Jonathan Huang School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA Alex Smola Yahoo! Research, Santa Clara, CA 95051, USA Kenji Fukumizu Institute of Statistical Mathematics, Tokyo 106-8569, Japan lesong@cs.cmu.edu jch1@cs.cmu.edu alex@smola.org fukumizu@ism.ac.jp Abstract In this paper, we extend the Hilbert space embedding approach to handle conditional distributions. We derive a kernel estimate for the conditional embedding, and show its connection to ordinary embeddings. Conditional embeddings largely extend our ability to manipulate distributions in Hilbert spaces, and as an example, we derive a nonparametric method for modeling dynamical systems where the belief state of the system is maintained as a conditional embedding. Our method is very general in terms of both the domains and the types of distributions that it can handle, and we demonstrate the effectiveness of our method in various dynamical systems. We expect that conditional embeddings will have wider applications beyond modeling dynamical systems. In this paper, we introduce the concept of embedding conditional distributions into a Hilbert space. Hilbert space embeddings of conditional distributions are potentially useful in applications where conditional distributions are the key quantities of interest, such as regressing structured response variables. In particular, we use the embeddings obtained in this paper to design a novel nonparametric approach for maintaining the belief state in a dynamical system. By not committing to fixed parameterizations of the transition and observation models, our nonparametric method handles a wider variety of problems than most existing solutions. Moreover, while typical inference algorithms for dynamical systems are formulated for a particular domain (such as Rn ), our method can be applied to any domain admitting an appropriate kernel function and can thus be applied to not only Euclidean spaces, but more exotic spaces (both continuous and discrete) such as the rotation matrices, permutations, strings, graphs, etc. In the following, we summarize some of the main contributions of this paper. 1. We introduce the concept of embedding conditional distributions in an RKHS and present a novel method for estimating such embeddings from training data. 2. We consider several useful probabilistic inference operations such as the sum rule and the chain rule, and show, using our theory, that these operations can be performed solely in the RKHS. 3. We apply our inference algorithms to learn nonparametric models and perform inference for dynamical systems. Our algorithms are general first because they handle a wide variety 1. Introduction Recent advances in kernel methods have yielded a method (Smola et al., 2007) for dealing with probability distributions by representing them as points in a suitable reproducing kernel Hilbert space (RKHS). This method of RKHS embeddings has proven itself to be a powerful and flexible tool for dealing with high-order statistics of random variables and has found wide application, ranging from twosample tests (Gretton et al., 2007) to dimensionality reduction (Song et al., 2008). Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Hilbert Space Embeddings of Conditional Distributions Table 1. Table of Notation random variable X Y Z domain X Y Z observation x y z kernel k(x, x ) l(y, y ) u(z, z ) kernel matrix K L U feature map (x), k(x, ·) (y), l(y, ·) (z), u(z, ·) feature matrix F G H RKHS dynamic system st st+1 ot+1 ture map and its empirical estimate: (a) µX := EX [(X)] , (b) µX := ^ 1 m m i=1 (xi ), (2) of possible nonlinear/nongaussian models and second because they apply in any setting in which an appropriate kernel function can be defined. 2. Hilbert Space Embedding We begin by providing an overview of Hilbert space embeddings in which one represents probability distributions by elements in a Hilbert space. In our setting of dynamical systems, we will eventually think of representing the belief state at each timestep as a point in an Hilbert space. In the following we denote by X, a random variable with domain X and refer to instantiations of X by the lower case character, x. We endow X with some -algebra A and denote the space of all probability distributions (with respect to A ) on X by P. Finally, we will consider a single distribution P (X) on X . A reproducing kernel Hilbert space (RKHS) F on X with kernel k is a Hilbert space of functions f : X R. Its dot product ·, · F , satisfies the reproducing property: f (·), k(x, ·) k(x, ·), k(x , ·) = f (x), and consequently, = k(x, x ), (1a) (1b) where DX = {x1 , . . . , xm } is a training set assumed to have been drawn i.i.d. from P (X). If the condition EX [k(X, X)] < is satisfied, then the mapping µX is guaranteed to be an element of the RKHS. By virtue of the reproducing property of F, both mappings satisfy m 1 µX , f F = EX [f (X)] and µX , f F = m i=1 f (xi ), ^ thus showing that we can compute expectations and empirical means with respect to P (X) and DX respectively by taking inner products with the embeddings µX and µX . We will hence also refer to the two em^ beddings as mean maps. Mean maps are attractive for several reasons (Smola et al., 2007; Sriperumbudur et al., 2008). First, it turns out that for certain choices of kernel functions, we can guarantee that distinct distributions map to distinct points in an RKHS. We say that such a kernel function is characteristic. More precisely, we have: Definition 1 When the mean map µX : P F is injective, the kernel function k is called characteristic. Secondly, while we rarely have access to the true underlying distribution, a finite sample of size m from a distribution P suffices to (with high probability) com1 pute an approximation within an error of Op (m- 2 ): Theorem 2 The empirical mean µX converges to µX ^ 1 in the RKHS norm at a rate of Op (m- 2 ). Cross-covariance operator. To incorporate the transition and observation models that arise in Markovian dynamics, we will present in the next section, as one of our main contributions, a method for embedding conditional distributions. Our technique relies on a generalization of the covariance matrix known as the cross-covariance operator CXY : G F, which is defined as (Baker, 1973): CXY = EXY [(X) (Y )] - µX µY , (3) F F meaning that we can view the evaluation of a function f at any point x X as an inner product, and the linear evaluation operator is given by k(x, ·), i.e. the kernel function. Alternatively, k(x, ·) can also be viewed as a feature map (x) where k(x, x ) = (x), (x ) F . Popular kernel functions on Rn include the polynomial d kernel k(x, x ) = x, x and the Gaussian RBF kernel 2 k(x, x ) = exp(- x - x ). Good kernel functions have also been defined on graphs, time series, dynamical systems, images, and structured objects (Sch¨lkopf o et al., 2004). Analogously, the above definitions and notational convention also apply to random variables Y and Z (See Table 1 for a summary). Embedding distribution. At the heart of the approach lie two simple embeddings -- the expected fea- where is the tensor product. Alternatively, CXY can also be viewed as an element in the tensor product space G F. Given two functions, f F and g G, their cross-covariance, CovXY [f (X), g(Y )] := EXY [f (X)g(Y )] - EX [f (X)]EY [g(Y )], is computed as: f, CXY g F or equivalently f g, CXY F G . (4) Given m pairs of training examples DXY = {(x1 , y1 ), . . . , (xm , ym )} drawn i.i.d. from P (X, Y ), we denote by = ((x1 ), . . . , (xm )) and = ((y1 ), . . . , (ym )) the feature matrices. The covariance operator CXY can then be estimated as Hilbert Space Embeddings of Conditional Distributions 1 1 ^ CXY = m ( - µX 1 )( - µY 1 ) = m H ^ ^ where H is an idempotent centering matrix defined 1 by H = I - m 11 . Theorem 5 Let kx := (x). Then µY |x can ^ be estimated as: µY |x = (HK + mI)-1 Hkx = ^ m i=1 i (x)(yi ), where each i is a real-valued weight. Theorem 5 shows that the empirical estimator of the conditional embedding bears a remarkable similarity to the estimator of the ordinary embedding from Equation (2). The difference is that, instead of applying 1 uniform weights m , the former applies non-uniform weights, i , on observations which are, in turn, determined by the conditioning variable. These nonuniform weights reflect the effects of conditioning on Hilbert space embeddings. A special case of this conditional embedding was employed in Quadrianto et al. (2008) for discrete conditioning variables x (multiclass labels), where a delta kernel is applied on x and the space of observations (y) is partitioned according to the value of x. In this case, 1 the conditional embedding is given by a i (x) = n if 1 xi = x otherwise i (x) = n-m , where n m is the total number of xi with label x.1 Our new definition of conditional embedding does not require an explicit partitioning of the observation space and is thus applicable to larger, more general spaces such as images and strings. We can further show that our estimator is consistent (proof omitted). Theorem 6 Assume k(x, ·) is in the range of CXX . The empirical conditional embedding µY |x converges to ^ 1 1 µY |x in the RKHS norm at a rate of Op ((m)- 2 + 2 ). Theorem 6 suggests that conditional embeddings are harder to estimate than ordinary embeddings. If we seek an estimator with a bias level of , we can fix to the order of O( 2 ) and obtain an overall convergence 1 rate of O(m- 2 ). 3. Conditional Embeddings In this section, we discuss our first main contribution: embedding conditional distributions of the form P (Y |X) into an RKHS. Unlike the embeddings discussed in the previous section, the conditional embeddings which we will define shortly will not be single elements in the RKHS, but will instead sweep out a family of points in the RKHS, each one indexed by a fixed value of the conditioning variable X. It is only by fixing X to a particular value x, that we will be able to obtain a single RKHS element, µY |x G. In other words, UY |X will be viewed as an operator mapping from F to G. In order to define the conditional embedding, UY |X , we will want it to satisfy two properties: 1. µY |x := EY |x [(Y )|x] = UY |X k(x, ·), and 2. EY |x [g(Y )|x] = g, µY |x G . Assume now that for all g G, the conditional expectation EY |X [g(Y )|X] is an element of F. We now show that the conditional embedding can be defined in terms of cross-covariance operators using a relation provided by Fukumizu et al. (2004), CXX EY |X [g(Y )|X] = CXY g. Definition 3 The operator UY |X -1 UY |X := CY X CXX . (5) is defined as: We can now show that that under our definition, UY |X satisfies the properties that we wanted. Theorem 4 Assuming that EY |X [g(Y )|X] F, the embedding of conditional distributions in Definition 3 satisfies properties 1 and 2. Proof By virtue of the reproducing property, given an x, we have EY |x [g(Y )|x] = EY |X [g(Y )|X] , k(x, ·) F . Using relation (5) and taking the conjugate -1 transpose of CXX CXY , we have EY |x [g(Y )|x] = -1 g, CY X CXX k(x, ·) G which proves the theorem. We remark that while the assumption EY |X [g(Y )|X] F always holds for finite domains with characteristic kernels, it is not necessarily true for continuous domains. In the cases where the assumption does not hold, We use the expression -1 CY X CXX k(x, ·) as an approximation of the conditional mean µY |x and propose a kernel estimate based on the expression for practical use. Given a dataset DXY of size m, the conditional embedding µY |x ^ can be estimated using the following theorem (proof omitted). 4. Operations on RKHS Embeddings Conditional embeddings and cross-covariance operators largely extend our ability to manipulate distributions embedded in an RKHS. In this section, we discuss several examples of this new vocabulary of operations. Sum rule: Given a joint distribution over X and Y , we would like to compute the marginal distribution over a single variable X by summing out Y . P (X) = Y P (X, Y ) = Y P (X|Y )P (Y ), We would like to now derive the Hilbert space counterpart of this operation. Using the law of total expectation, we have µX = EXY [(X)] = EY EX|Y [(X)|Y ]. Plugging in the conditional embedding, we have µX = EY [UX|Y (Y )] = UX|Y EY [(Y )] = UX|Y µY . (6) 1 Negative weights 1 n-m are due to centering matrix H. Hilbert Space Embeddings of Conditional Distributions Chain rule: The chain rule is given by P (X, Y ) = P (X|Y )P (Y ) = P (Y |X)P (X), where we factorize the joint distribution into a product of two distributions. This rule builds the connection between the two ways of factorization which form the basis for Bayes rule. In this case, if we use a tensor product joint feature map (X) (Y ), there are two equivalent ways of factoring µXY = EXY [(X) (Y )] according to the law of total expectation: EY [EX|Y [(X)|Y ] (Y )] = UX|Y EY [(Y ) (Y )] EX [EY |X [(Y )|X] (X)] = UY |X EX [(X) (X)]. Let µ := EX [(X) (X)] and µ := EY [(Y ) X Y (Y )], we have: µXY = UX|Y µ = UY |X µ Y X (7) Setting. We model uncertainty in a dynamic system using a partially observable Markov model, which is a joint distribution P (s1 , . . . , sT , o1 , . . . , oT ) where st is the hidden state at timestep t and ot is the corresponding observation. We assume Markovian dynamics, so that the joint distribution factorizes t t t t-1 as P o1 |s1 . The conditional t P (o |s ) P s |s t t-1 distribution P (s |s ) is called the transition model, and describes the evolution of the system from one timestep to the next; the distribution P (ot |st ) is called the observation model, and captures the uncertainty of a noisy measurement process. We focus on filtering, in which one queries the model for the posterior at some timestep conditioned on all past observations. Denote the history of the dynamic system as ht := (o1 , . . . , ot ). In filtering, one recursively maintains a belief state, P (st+1 |ht+1 ), in two steps: a prediction step and a conditioning step. The first updates the distribution by multiplying by the transition model and marginalizing out the previous timestep: P (st+1 |ht ) = Est |ht [P (st+1 |st )|ht ]. The second conditions the distribution on a new observation ot+1 using Bayes rule: P (st+1 |ht ot+1 ) P (ot+1 |st+1 )P (st+1 |ht ). Prediction in Hilbert space. The prediction and conditioning steps which we have discussed can be exactly reformulated with respect to the Hilbert space embeddings. We begin by discussing how the prediction step of inference can be implemented with respect to the estimated Hilbert space embedding of the distribution P (st+1 |ht ) = Est |ht [P (st+1 |st )|ht ]. Due to the fact that we maintain the belief state recursively, we can assume that µst |ht is known. Thus our goal is to express the Hilbert space embedding, µst+1 |ht , in terms of µst |ht and the conditional embedding Ust+1 |st , for the transition model P (st+1 |st ). We have the following (using the notation from section 2): Theorem 7 Hilbert space prediction step is given by: µst+1 |h = Ust+1 |st µst |h . Proof Using Markov property st+1 ht |st , we have Est+1 |h [(st+1 )|h] = Est |h [Est+1 |st [(st+1 )|st ]|h]. Furthermore, Est |h [Est+1 |st [(st+1 )|st ]|h] = Ust+1 |st µst |h using the sum rule in Section 4. Conditional cross-covariance. Sometimes, we would like to measure the strength of dependence between two random variables Y and Z conditioned on a given value X = x. To handle these situations, it will be convenient to first introduce a new operator, conditional cross-covariance operator 2 . In particular, just as the cross-covariance operator CXY enabled us to compute CovXY [f (X), g(Y )] using Equation (4), we would like to have an operator CY Z|x which would enable us to compute CovY Z|x [g(Y ), r(Z)|x] := EY Z|x [g(Y )r(Z)|x] - EY |x [g(Y )|x]EZ|x [g(Z)|x] via: g, CY Z|x r G or equivalently g r, CY Z|x GH . (8) We show defining the conditional cross-covariance by CY Z|x := µY Z|x - µY |x µZ|x (9) works: CovY Z|x [g(Y ), r(Z)|x] can be written as an inner product, g r, EY Z|x [(Y ) (Z)|x] - EY |x [(Y )|x] EZ|x [(Z)|x] GH ; Plugging in conditional embeddings leads to the expression. The conditional dependence between Y and Z given x can be quantified by, e.g. the Hilbert-Schmidt norm of CY Z|x . This new quantity measures the dependence between Y and Z locally at x; it is different from the dependence measure designed by Fukumizu et al. (2008) where the whole space of X is integrated out. 5. Application: Dynamical Systems Having now developed the necessary machinery for manipulating conditional embeddings, we turn our attention to learning and inference in a dynamical system. In this section, we formulate a simple nonparametric algorithm for learning models and performing probabilistic inference in dynamical systems. 2 A different operator with the same name also appeared in Fukumizu et al. (2004). There X is integrated out. Conditioning in Hilbert space. We now discuss how the conditioning step of inference can be implemented with respect to the Hilbert space embeddings, in which we obtain the embedding of the posterior distribution, µst+1 |ht ot+1 , given the prior and likelihood. Note that the prior term, µst+1 |ht , is supplied by the prediction step in the recursion. Here the goal is to obtain µst+1 |ht ot+1 by conditioning further on variable Hilbert Space Embeddings of Conditional Distributions ot+1 , and we accomplish this iterated conditioning by using the conditional cross-covariance operator from Section 4. We have the following: Theorem 8 Hilbert space conditioning step is given -1 by: µst+1 |ht ot+1 = Cst+1 ot+1 |ht Cot+1 ot+1 |ht (ot+1 ). Proof This is a straightforward application of the conditional cross-covariance operator in section 4 and conditional embedding from Theorem 4. As an intermediate step, the embeddings for µot+1 ot+1 |ht and µot+1 |ht are needed. They can be derived using the sum rule and the Markov property. While having the exact updates for prediction (Theorem 7) and conditioning (Theorem 8) is theoretically satisfying, it does not lead to practical estimation algorithms due to the fact that the conditional crosscovariance operators need to be estimated in each conditioning step, which is both statistically difficult and computationally costly. In the following, we will make simplifying assumptions to the dynamical systems and derive an efficient approximate algorithm. Approximate inference. Our goal still remains to recursively maintain the Hilbert space embedding µst |ht . However, in our approximation, we directly construct the operator Ust+1 |st ot+1 = -1 Cst+1 (st ot+1 ) C(st ot+1 )(st ot+1 ) , which takes both the previous belief state and an observation as inputs, and outputs the predictive embedding for P (st+1 |st ot+1 ). That is µst+1 |st ot+1 = Ust+1 |st ot+1 ((st , ot+1 )), where (·) is the joint feature map for st and ot+1 . In the same spirit as the decomposition of sufficient statistics in exponential families (Altun et al., 2004; Zhang et al., 2009), we use the concatenation of the feature map (st ) and (ot+1 ) as ((st , ot+1 )); that is ((st , ot+1 )) = ((st ) , (ot+1 ) ) whose kernel decomposes into the sum of two kernels: ((s, o)), ((s , o )) = k(s, s ) + u(o, o ). With this feature map, the operator Ust+1 |st ot+1 can be equivalently viewed as the concatenation of two operators, Ust+1 |st ot+1 = (T1 , T2 ). More specifically, µst+1 |st ot+1 = T1 (st ) + T2 (ot+1 ) (10) where we have approximated the distribution P (st |ht , ot+1 ) by a less confident one P (st |ht ). Updating the belief state with respect to RKHS embeddings thus conveniently decomposes into two simple operations: first, propagate from time t to time t + 1 via T1 µst |h ; and second, account for the most current current observation via T2 (ot+1 ). The main point which distinguishes our approximation from the exact inference operations is that the approximate approach carries out the prediction and conditioning operations in parallel while the exact method performs the operations sequentially. Computationally, our approximate algorithm is also quite efficient. The operators T1 and T2 only needs to be estimated once (during the training stage), and at each filtering iteration, we only need to perform a matrix-vector multiplication. More specifically: Theorem 9 Let G := , and uot+1 := (ot+1 ). The update rule for the belief state is given by: µst+1 |hot+1 (HK + HU + mI)-1 H(G + ^ uot+1 ). Learning can be accomplished in O(m3 ) time and each filtering iteration requires O(m2 ) time. Proof Applying Theorem 5, we have µst+1 |hot+1 as ^ (HK+HU +mI)-1 H( µst |h + (ot+1 )), where ^ we have used the fact that ((s, o)), ((s , o )) = k(s, s ) + u(o, o ). Since the belief state µst |h is main^ tained by and a set of weights , replacing µst+1 |h ^ by , we have the update rule in the theorem. Let ~ ~ ~ T2 := (HK + HU + mI)-1 H and T1 := T2 G (note t t+1 that since s and s are in the same space, G is a valid inner product matrix). Therefore, updating µst+1 |hot+1 is also equivalent to updating via ^ ~ ~ ~ ~ T1 + T2 uot+1 . Computing T1 and T2 in the learning stage is dominated by a matrix inversion which is O(m3 ) in general. In the filtering stage, applying the update rule for involves O(m) kernel evaluations and two O(m2 ) matrix-vector multiplications. Each filtering iteration is O(m2 ) in term of the number of kernel evaluations. MAP inference. We have now shown how to maintain a posterior distribution over the hidden state conditioned on the current and all past observations. Sometimes, however, we would like to determine the MAP (maximum a posteriori) assignments of hidden states. We accomplish MAP inference by performing the optimization st+1 = ^ 2 argmins (s) - µst+1 |hot+1 G . Since the belief state ^ µst+1 |hot+1 is maintained as the feature matrix and ^ a set of weights , the optimization can be expressed using kernels as: st+1 = argmaxs 2 ls - l(s, s), ^ (12) Next we want to marginalize out variable st and condition on the history ht . We follow the same logic from the sum rule in Section 4, and take the expectation of µst+1 |st ot+1 with respect to P (st |ht ot+1 ): µst+1 |ht ot+1 = Est |ht ot+1 µst+1 |st ot+1 = T1 Est |ht ot+1 (st )|ht ot+1 + T2 (ot+1 ) T1 Est |ht (st )|ht + T2 (ot+1 ) = T1 µst |ht + T2 (ot+1 ), (11) Hilbert Space Embeddings of Conditional Distributions Algorithm 1 Learning Input: = ((st )), = ((st+1 )) and = ((ot+1 )) where t = 1 . . . m. 1: Compute K = , U = , G = . ~ 2: Compute T2 := (HK + HU + mI)-1 H, ~ ~ 3: Compute T1 := T2 G . Algorithm 2 Filtering 1 ~ ~ Input: , T1 , T2 from learning, and let = m 1. 1: for each new observation ot+1 do 2: Compute uot+1 = (ot+1 ). ~ ~ 3: Belief update: T1 + T2 uot+1 . 4: Inference: st+1 = argmaxs 2 ls - l(s, s). ^ 5: end for a function W(x) is learned for mapping from x to (y). By solving an optimization, they derive the form, (K + I)-1 , for W. This is exactly the conditional embedding operator UY |X we discussed in this paper. Therefore, our work immediately provides theoretical and statistical support for their approach. Finally, Vishwanathan et al. (2007) considered linear dynamical systems in an RKHS, where a transition operator A and an observation operator B are used to describe the dynamics: (st+1 ) = A(st ) and (ot+1 ) = B(st+1 ). Under our Hilbert space embedding view of dynamical systems, the operators A and B correspond to Ust+1 |st and Uot+1 |st+1 respectively, which enables the entire distribution for the hidden state st to propagate through the dynamics. where ls := (s). Note that this optimization may be a hard problem in general, and is analogous to the argmax operation in structured SVMs (Tsochantaridis et al., 2005). In many cases, however, it is possible to define the feature map (s) in such a way that an efficient algorithm for solving the optimization (Equation (12)) can be obtained. For instance, consider the identity management problem arising in multiobject tracking, where the hidden states are permutations (one-to-one matchings between tracks and identities). Using the kernel l(, ) := tr( ) defined over permutation matrices, we obtain a linear assignment problem which can be solved efficiently using a variety of algorithms. Algorithm. Our method maintains a representation of the state distribution as a weighted sum of feature functions evaluated at the set of training examples. To propagate the state distribution with respect to Markovian dynamics and to condition, our algorithms can simply be implemented using kernel evaluations and matrix operations on the weight vector . We summarize our method using the pseudocode in Algorithms 1 and 2. In the learning phase, we use the training data to form the appropriate kernel matrices and operator estimates. During the filtering phase, we are presented with an online input sequence of test observations and we use the update rule to maintain state distributions at each timestep. The user-specified parameters are kernel choices for the state and observation spaces, and the regularization parameter. 7. Experiments We evaluate our algorithm on two dynamical systems: a linear dynamical systems with known model parameters, and a challenging camera tracking problem. Synthetic data. We simulate a particle rotating around the origin in R2 according to the following cos() sin() t dynamics: st+1 = s + , ot+1 = - sin() cos() st+1 + where = 0.02 radians, and are process noise and observation noise respectively. To evaluate our algorithm, we compare the performance of our method to the performance of a Kalman filter in estimating the position of the particle. The Kalman filter requires the exact knowledge of the dynamics in order to perform filtering, while our method only requires a sequence of training pairs of hidden states and observations. We study how the different methods respond to various noise models. In particular, we use 4 noise configurations arranged in increasing nonlinearity (Let Q1 := N (-0.2, 10-2 I) and Q2 := N (0.2, 10-2 I)): Process noise Observation noise (i) N (0, 10-2 I) N (0, 10-1 I) -1 (ii) N (0, 10 I) N (0, 10-2 I) 1 (iii) Gamma (1, 3 ) N (0, 10-1 I) 1 (iv) Gaussian Mixture 1 Q1 + 2 Q2 N (0, 10-2 I) 2 6. Related Work Gaussian process regression uses kx (K +I)-1 y as the predictive mean (Rasmussen & Williams, 2006). As it turns out, this is equivalent to the conditional embedding of the distribution P (Y |x) restricted to a linear feature map on Y . Cortes et al. (2005) studied kernel dependency estimation for transduction. There, We also study how our method scales in terms of filtering performance as more training data are available. We generate training data by observing the particle rotate for 2n steps where n = 1 . . . 6. The test data always contains observations from a single cycle. For each training set size, we randomly instantiate the experiment 50 times to obtain the standard errors. We use an RBF kernel for both the hidden state and the observations, and we set the regularization parameter to = 10-3 . We perform MAP inference by searching over a predefined grid over [-3 : 0.03 : 3]2 (alternatively, one can also use a gradient descent for MAP). Hilbert Space Embeddings of Conditional Distributions (a) case (i) (b) case (ii) We generate a sequence of image observations as follows: first we uniformly sample 180 random rotation matrices. Then we interpolate between these anchor positions using 20 midpoints. For each generated rotation matrix, we render an image using the rotation matrix as camera parameters. In total we obtain 3600 frames, and we use the first 1800 frames for training and the remaining 1800 frames for testing. The dynamics governing the camera rotation is a piece-wise smooth random walk. This is an unconventional dynamical system in that the hidden state is a rotation matrix R from SO(3); and the observations are images which are high dimensional spaces with correlation between pixel values. We flatten each image to a vector of R1200 , and apply a Gaussian RBF kernel. The bandwidth parameter of the kernel is fixed using the median distance between image vectors, and = 10-6 . We use an ~ inner product kernel between two rotations R and R, ~ ~ := tr(R R). Using this kernel, we can peri.e. k(R, R) form efficient MAP inference for the rotation matrix. The optimization problem then becomes a linear program over SO(3): argmaxR tr(A R) s.t. R SO(3), where A is obtained from and can be efficiently solved using a steepest descent algorithm over Stiefel manifold (Abrudan et al., 2008). We compare our method to a Kalman filter and random guessing. For the Kalman filter, we used the quaternion corresponding to a rotation matix R as the hidden state and the image vectors as the observations. We learn the model parameters of the linear dynamical system using linear regression. We consider ^ two error metrics. The first is tr(R R) which is the equal to 1 + 2 cos() where is the angle between the ^ true rotation matrix R and its estimate R. This metric ranges between [-1, 3]. As a second metric, we use ^ the Frobenius norm R - R F ro . Note that these two ^ ^ measures are related, i.e. R- R 2 ro = 6-2 tr(R R). F The results on the test image sequence are shown in Table 7. We can see that in terms of both error metrics, our method performs significantly better than the Kalman filter and random guessing. To investigate whether incorporating dynamics really helps to better estimate the camera rotations, we compare our RKHS filter to a prediction using conditional embedding3 that completely ignores the dynamics (camera orientations are close for adjacent timepoints). We add zero mean Gaussian white noise to the images and study the performance scaling of the two methods as we increase the noise variance. We ob ¸ ^ We use R := argmaxR R, UY |X (x) where UY |X is the conditional embedding from images X to rotations Y . 3 (c) case (iii) (d) case (iv) Figure 1. Averge RMS and standard error over 50 runs of the experiment for noise configurations i,ii,iii and iv. KF: Kalman filter; RKHS: our method. The average RMS (root mean square error) over 50 runs and the standard errors are shown in Figure 1. The general trend is that, as more training data are available, both the average RMS and the standard error of our method decrease monotonically. In cases when the Kalman filter is optimal (case i, ii), the performance of our method approaches that of the Kalman filter as more training data become available. In nonlinear dynamical systems (case iii, iv), our method outperforms the Kalman filter when training datasets are sufficiently large. This is expected since the Kalman update is specifically designed for Gaussian noise while our nonparametric method is more versatile in handling general noise distributions. Camera rotation. Here we apply our dynamical system tools to a computer vision problem. We try to determine the camera orientation based on the images it observes. In our setting, the camera focal point is fixed at a position and traces out a smooth path of rotations while making observations. To evaluate our algorithm, we use POVRAY (www.povray.org) to render images observed by the camera. The virtual scene is a rectangular-shaped room with a ceiling light and two pieces of furniture. The images exhibit complex lighting effects such as shadows, interreflections, and global illumination, all of which make determining the camera rotation difficult especially for noisy cases. As an illustration, we display sample images in Figure 2. The sizes of the color images are 100 × 100 × 3. We downsample them to 20 × 20 × 3 in order for our competitor, the Kalman filter, to learn its model efficiently. Hilbert Space Embeddings of Conditional Distributions (a) (b) Figure 2. (a) Example rendered image of the well known Cornell box with a rotating camera. (b) RKHS filter compared to a simple prediction that ignores dynamics in estimating camera rotation. Table 2. Errors in estimating the camera rotation matrices by three methods. Note that for the trace measure, the larger the number the better the performance; but for the Frobenius norm, the smaller the number the better. Error Metric RKHS Kalman Random Trace 2.91 ± 0.005 2.18 ± 0.016 0.01 ± 0.023 Frobenius 0.17 ± 0.005 1.18 ± 0.012 2.40 ± 0.023 serve that the performance of the RKHS filter which does take into account the dynamics degrades more gracefully for increasing noise levels (Figure 2(b)). 8. Conclusion We have presented a general framework for embedding conditional distributions into a reproducing kernel Hilbert space. The resulting embedding method greatly improves our ability to manipulate distributions in RKHS, and in particular, we showed how conditional embeddings are useful for modeling and performing nonparametric inference on dynamical systems. The advantage using the Hilbert space embedding lies in its generality and ease of implementation. Our method suffers from several limitations at the moment. For example, we require training data for the hidden states. We believe that extending our embedding approach along these directions will be both fruitful and interesting for future research, and we expect that conditional embedding will be a useful tool for learning problems beyond dynamical systems. Acknowledgments LS is supported by a Ray and Stephenie Lane Fellowship. JH is supported by the ONR under MURI N000140710747, and the NSF under grants DGE0333420, EEEC-540865, NeTS-NOSS 0626151, TF0634803. References Abrudan, T., Eriksson, J., & Koivunen, V. (2008). Steepest descent algorithms for optimization under unitary matrix constraint. IEEE Transactions on Signal Processing, 56 (3). Altun, Y., Smola, A. J., & Hofmann, T. (2004). Exponential families for conditional random fields. In Uncertainty in Artificial Intelligence. Baker, C. (1973). Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186, 273­289. Cortes, C., Mohri, M., & Weston, J. (2005). A general regression technique for learning transductions. In International Conference on Machine Learning. Fukumizu, K., Bach, F. R., & Jordan, M. I. (2004). Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5, 73­99. Fukumizu, K., Gretton, A., Sun, X., & Sch¨lkopf, B. o (2008). Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems 20. Gretton, A., Borgwardt, K., Rasch, M., Sch¨lkopf, B., o & Smola, A. (2007). A kernel method for the twosample-problem. In Advances in Neural Information Processing Systems 19. Quadrianto, N., Smola, A., Caetano, T., & Le, Q. (2008). Estimating labels from label proportions. In International Conference in Machine Learning. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press. Sch¨lkopf, B., Tsuda, K., & Vert, J.-P. (2004). Kero nel Methods in Computational Biology. Cambridge, MA: MIT Press. Smola, A., Gretton, A., Song, L., & Sch¨lkopf, B. o (2007). A hilbert space embedding for distributions. In Algorithmic Learning Theory. Song, L., Smola, A., Borgwardt, K., & Gretton, A. (2008). Colored maximum variance unfolding. In Advances in Neural Information Processing Systems 20. Sriperumbudur, B., Gretton, A., Fukumizu, K., Lanckriet, G., & Sch¨lkopf, B. (2008). Injective o hilbert space embeddings of probability measures. In Conference on Learning Theory. Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453­1484. Vishwanathan, S. V. N., Smola, A. J., & Vidal, R. (2007). Binet-Cauchy kernels on dynamical systems and its application to the analysis of dynamic scenes. International Journal of Computer Vision, 73 (1), 95­119. Zhang, X., Song, L., Gretton, A., & Smola, A. (2009). Kernel measures of independence for non-iid data. In Advances in Neural Information Processing Systems 21.