Robot Trajectory Optimization using Approximate Inference Marc Toussaint TU Berlin, Franklinstr 28/29 FR6-9, 10587 Berlin, Germany mtoussai@cs.tu-berlin.de Abstract The general stochastic optimal control (SOC) problem in robotics scenarios is often too complex to be solved exactly and in near real time. A classical approximate solution is to first compute an optimal (deterministic) trajectory and then solve a local linear-quadratic-gaussian (LQG) perturbation model to handle the system stochasticity. We present a new algorithm for this approach which improves upon previous algorithms like iLQG. We consider a probabilistic model for which the maximum likelihood (ML) trajectory coincides with the optimal trajectory and which, in the LQG case, reproduces the classical SOC solution. The algorithm then utilizes approximate inference methods (similar to expectation propagation) that efficiently generalize to non-LQG systems. We demonstrate the algorithm on a simulated 39-DoF humanoid robot. relation between expectation-maximization and (immediate reward) RL observed by (Dayan & Hinton, 1997), Attias' (2003) inference approach to planning, and an EM approach to solving MDPs and POMDPs (Toussaint et al., 2006). In the control literature, the relation between stochastic optimal control (SOC) and inference (Kalman estimation) is long known for the special LQG case (e.g., Stengel, 1986). However, one of the most interesting and not fully solved questions in the field is how this transfer can be generalized to non-LQG problems ­ we will discuss such work in some more detail below. In this paper we do not try to propose an exact and general reformulation of SOC problems as an inference problem. Instead, in its present form (focusing on Gaussian messages), our approach aims at a local approximate solution to the SOC problem that is fast to compute. This is in the tradition of the classical SOC literature (see, e.g, Bryson & Ho, 1969; Stengel, 1986), where the LQG case is typically introduced as a (variational) model to control perturbations from an optimal trajectory. That is, the general non-linear SOC problem is approximately solved by first computing an optimal trajectory for the noise-free system, and then deriving a local LQG model of the perturbations around this optimal trajectory. As long as the system is not perturbed too far from the optimal trajectory, the corresponding linear quadratic regulator (LQR) is a reasonable solution to the SOC problem. The 2nd stage of computing the local LQR is analytically solved by the Ricatti equations (see below); the main problem arises with the non-linear trajectory optimization problem in the first stage. The trajectory optimization problem can be solved, e.g., by gradient methods (typically with a splinebased trajectory encoding, e.g., (Chen, 1991; Zhang & Knoll, 1995; Schlemmer & Gruebel, 1998)) or by sequential quadratic programming (SQP) schemes. In every iteration of SQP one devices a quadratic approximation of the objective (cost) function ­ the optimum of which can be computed using the Ricatti equations. Hence, SQP typically involves iterating LQG solution methods. An instance of such an algorithm is iLQG 1. Introduction Trajectory optimization is a key problem in robotics, in particular when the time needed for optimization is critical. For instance, in interactive scenarios with a human, robots need to react on dynamic changes of the environment and the optimization time should be no more than the temporal duration of the movement itself so that an online sequencing is in principle possible. The aim of this work is to push the performance of trajectory optimization algorithms a little further using approximate inference techniques. The transfer of inference methods to stochastic control and Reinforcement Learning (RL) problems has a long history. Examples in the context of decision making and RL include the early work on inference in influence diagrams (Cooper, 1988; Shachter, 1988), the Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Robot Trajectory Optimization using Approximate Inference (Todorov & Li, 2005). An aspect of such methods which will contrast to our approach is that in each SQP iteration one computes a global trajectory x0:T (by "global" we mean "over the full time interval 0..T ") and a global LQG approximation. Our framework will naturally lead to iterative updates of local messages rather than a global trajectory. In the LQG case, the similarity between the Ricatti equations and Kalman's equation for state estimation is referred to as "Kalman duality" (Stengel, 1986; Todorov, 2008) ­ but it strictly holds only in the LQG case. The crucial question is how to generalize this duality to the non-LQG case. Todorov (2008) introduces a special case of SOC problems which can be solved via inference (see also (Kappen et al., 2009)). An alternative to seeking an exact duality is to focus on optimal trajectories and consider a probabilistic model for which the ML solution coincides with the optimal trajectory (Todorov, 2008). We follow this latter approach and will utilize approximate inference methods to efficiently find the ML trajectory and the local LQG solution around this trajectory. In previous work (Toussaint & Goerick, 2007) we applied the same idea for a special case of coupled task and joint space planning. In this paper we focus on the general theory and SOC case. In the next section we will briefly introduce the framework of SOC, the LQG case and the SQP method iLQG. In section 3 we formulate the general probabilistic model. We derive an exact Gaussian message passing algorithm for the special LQG case and show how it is related to the Ricatti equations. Then we generalize to the non-LQG case based on Expectation Propagation (Minka, 2001b). Finally, section 4 presents an evaluation on a simulated humanoid reaching problem. The optimal value function Jt (x) gives the expected future cost when in state x at time t for the best controls and obeys the Bellman optimality equation Jt (x) = min ct (x, u) + u x P (x | u, x) Jt+1 (y) . (4) There are two versions of stochastic optimal control problems: The open-loop control problem is to find a control sequence u that minimizes the expected 1:T cost. The closed-loop (feedback) control problem is to find a control policy t : xt ut (that exploits the true state observation in each time step and maps it to a feedback control signal) that minimizes the expected cost. The linear quadratic gaussian (LQG) case plays an important role as a perturbation model or as an ingredient in iterative solution methods. LQG is a linear control process with Gaussian noise, P (xt+1 | xt , ut ) = N (xt+1 | At xt + at + Bt ut , Qt ) , and quadratic costs, ct (xt , ut ) = xt Rt xt - 2rt xt + ut Ht ut . (5) The LQG process is defined by matrices and vectors A0:T , a0:T , B0:T , Q0:T , R0:T , r0:T , H0:T . As a convention, in the remainder of this paper we will drop the subscript t for A, a, B, Q, R, r, H ­ wherever a subscript is missing we refer to time t. The LQG case allows us to derive an exact backward recursion, called Ricatti equation, for the computation of the value function. The value function will always be a quadratic form of the state. Assume that at time t+1 the optimal value function can be expressed as Jt+1 (x) = x Vt+1 x - 2vt+1 x (6) 2. Stochastic optimal control We consider a discrete time stochastic controlled system of the form xt+1 = ft (xt , ut ) + , N (0, Qt ) , n for some matrix Vt+1 and some vector vt+1 . Applying equation (4) in a straight-forward manner one can derive that Jt (x) is of the form Jt (x) = x Vt x - 2x vt + terms independent of x , Vt = R + A Vt+1 A - KVt+1 A K := A Vt+1 (Vt+1 + B- HB-1 )-1 , and the minimization in (4) is given by u (x) = -(H + B Vt+1 B)-1 B (Vt+1 (Ax + a) - vt+1 ) . t (9) Equations (7)-(9) are the Ricatti equations (Stengel, 1986). (Using the Woodbury identity it can be rewritten in other forms.) Initialized with VT = RT and (7) vt = r + A (vt+1 - Vt+1 a) - K(vt+1 - Vt+1 a) (8) (1) with time step t, state xt R , control ut Rm , and Gaussian noise of covariance Q. We also use the notation P (xt+1 | ut , xt ) = N (xt+1 | ft (xt , ut ), Qt ), where 1 N (x|a, A) exp{- (x - a) A-1 (x - a)} (2) 2 is a Gaussian over x with mean a and covariance A. For a given state-control sequence x0:T , u0:T we define the cost as T C(x0:T , u0:T ) = t=0 ct (xt , ut ) . (3) Robot Trajectory Optimization using Approximate Inference Algorithm 1 iLQG 1: Input: initial trajectory x0:T , convergence rate , control costs H0:T , functions At (x), at (x), Bt (x), Rt (x), rt (x) 2: Output: trajectory x0:T , LQR V0:T , v0:T 3: repeat 4: access RT (xT ), rT (xT ) 5: VT RT , vT rT 6: for t = T - 1 to 0 do // bwd Ricatti recursion 7: access At (xt ), at (xt ), Bt (xt ), Rt (xt ), rt (xt ) 8: compute Vt and vt using (7,8) 9: end for 10: for t = 0 to T - 1 do // fwd control recursion 11: compute u (xt ) using (9) t 12: xt+1 (1 - )xt+1 + [At xt + at + Bt u (xt )] 13: end for 14: until convergence optimality are not fully equivalent (there is yet no general version of Kalman's duality known) ­ but both are interesting definitions of optimality. Accordingly, the formulation we present below will not in all respects be equivalent to the problem of expected cost minimization, but we can ensure that the ML trajectory of the conditioned probabilistic model coincides with the optimal trajectory in the classical case. Let us introduce an additional binary random variable zt in the process with conditional probability P (zt = 1 | ut , xt ) = exp{-ct (xt , ut )} . (10) vT = rT this gives a backward recursion to compute the value function Jt at each time step. The linear quadratic regulator (LQR) in (9) gives the optimal control policy, i.e., a solution to the closed-loop problem in the LQG case. Note that the optimal control and path is independent of the process noise Q. As mentioned in the introduction, the LQG case is classically introduced as a perturbation model around an optimal trajectory of the deterministic system (Bryson & Ho, 1969; Stengel, 1986). Alternatively, the LQG case can also be applied as a means to find this optimal trajectory: Optimizing the trajectory x0:T can be addressed in the manner of sequential quadratic programming, where one starts with a guess of x0:T , then computes a 2nd order approximation of the cost around this initialization, uses an exact solver (LQG in this case) to jump to the optimum of this 2nd order approximation, and iterates. This method, termed iLQG by (Todorov & Li, 2005), computes a (locally) optimal trajectory and as an aside also provides the corresponding LQR around this trajectory which can be used to handle perturbations. Algorithm 1 makes this procedure explicit and includes an additional convergence rate parameter which increases robustness greatly. As mentioned in the introduction, this idea has a long history (Cooper, 1988; Shachter, 1988; Dayan & Hinton, 1997). Shachter and Peot (1992) even mention work by Raiffa (1969) and von Neumann & Morgenstern (1947) in this context. In our definition (10), the costs ct (xt , ut ) play the role of the neg-log-probability of zt = 1, in accordance to the typical identification of energy with neg-log-probability ­ which differs from the above cited approaches in which a binary random variable with probability proportional to reward or utility is defined. The neg-log approach will lead us directly to an inference version of LQG for which we can exploit approximate inference techniques. Clearly, for a single state-control trajectory x0:T , u0:T , the log-likelihood log P (z0:T = 1 | x0:T , u0:T ) = -C(x0:T , u0:T ) (11) is the negative classical cost (3) ­ hence an ML trajectory coincides with the classical optimal trajectory. However, taking the expectation over trajectories, logP (z0:T = 1) T = log Eu0:T ,x0:T { t=0 T P (zt = 1 | ut , xt )} P (zt = 1 | ut , xt )} Eu0:T ,x0:T {log t=0 T (12) 3. Probabilistic inference approach The classical approach to design a good trajectory is to define a cost function (e.g., penalizing collisions, rewarding goals) and minimize the expected cost given a stochastic control model. An alternative take on defining what a good trajectory is to condition a probabilistic trajectory model on desired criteria (e.g., conditioning on not observing collision, conditioning on reaching a goal) and consider the problem of inferring the posterior distribution of trajectories conditioned on these criteria. In general these two ways of defining = Eu0:T ,x0:T {- t=0 ct (ut , xt )} (13) = -Eu0:T ,x0:T {C(u0:T , x0:T )} , we find that likelihood maximization is in general not equivalent to expected cost minimization. In the LQG case there exists a concise relation, as we show in the next section. 3.1. LQG case and equivalence to Ricatti In the LQG case the costs ct (xt , ut ) are quadratic in xt and ut as given by (5). In terms of the binary random Robot Trajectory Optimization using Approximate Inference variable zt this translates to Gaussian probabilities P (zt = 1 | xt ) N [xt | rt , Rt ] , P (ut ) = N [ut | 0, H] , (14) (15) -1 Vt = A-1 [Q + Bt H -1 Bt + (Vt+1 + Rt+1 )-1 ]At t µzt xt (xt ) = N [xt | rt , Rt ] (22) where we interpreted the quadratic cost over ut as a prior and let zt only depend on xt (which is equivalent but more convenient than a uniform prior over ut and P (zt = 1 | ut , xt ) N [xt | rt , Rt ]N [ut | 0, H]). The bracket notation 1 N [x|a, A] exp{- x A x + x a} 2 (16) The straight-forward derivation of these messages is given in the appendix. The backward messages are closely related to the Ricatti equation as we will show below. However, note that there is no counterpart of the forward messages in the classical approaches. This is crucial. The forward messages are necessary to estimate a proper posterior belief of a state and this belief will play an important role in approximate methods below. Concerning the backward messages, let us define -1 ¯ Vt+1 = Vt+1 + Rt+1 denotes a Gaussian over x in canonical form, with precision matrix A and mean A-1 a. We can simplify by integrating out the control ut , P (xt+1 | xt ) = udu N (xt+1 | Axt + a + But , Q) N [ut | 0, H] = N (xt+1 | Axt + a, Q + BH -1 B ) . (17) (23) (24) vt+1 = ¯ -1 Vt+1 vt+1 + rt+1 , which corresponds to a backward message (in canonical representation) which has the cost message already absorbed. Using a special case of the Woodbury identity, (A-1 + B)-1 = A - A(A + B-1 )-1 A , the bwd messages can be rewritten as -1 ¯ -1 Vt+1 = A [Vt+1 + Q + BH -1 B ]-1 A ¯ ¯ = A Vt+1 A - K Vt+1 A Consider the problem of computing the posterior distribution P (x0:T | z0:T = 1) over state trajectories conditioned on that we permanently observe zt = 1. Inference is a standard forward-backward process. We prefer to derive this inference process in terms of a message passing algorithm1 because this will allow us most directly to generalize the algorithm to the nonlinear case. Theorem 1. The messages in the LQG process defined by equations (17) and (14) are µxt-1 xt (xt ) = N (xt | st , St ) -1 -1 st = at-1 + At-1 (St-1 + Rt-1 )-1 (St-1 st-1 + rt-1 ) -1 St = Q + Bt-1 H -1 Bt-1 + At-1 (St-1 + Rt-1 )-1 At-1 (25) (26) ¯t+1 [Vt+1 + (Q + BH -1 B )-1 ]-1 ¯ K := A V -1 ¯ ¯ Vt vt = -A Vt+1 at + A vt+1 + K Vt+1 at - K vt+1 ¯ ¯ ¯t+1 at ) - K(¯t+1 - Vt+1 at ) (27) ¯ = A (¯t+1 - V v v ¯ ¯ Vt = Rt + (A - K)Vt+1 A ¯ vt = rt + (A - K)(¯t+1 - Vt+1 at ) ¯ v (28) (29) (20) µxt+1 xt (xt ) = N (xt | vt , Vt ) (21) -1 -1 vt = -A-1 at + A-1 (Vt+1 + Rt+1 )-1 (Vt+1 vt+1 + rt+1 ) t t 1 We briefly recap the definition of messages in pair-wise factor graphs; see (Yedidia et al., 2001; Murphy, 2002; Minka, 2001a) for a thorough introduction. Given two random variables Xi and Xj coupled by a pair-potential fij (Xi , Xj ), the message passing equations are They correspond exactly to the Recatti equations (7), (8) except for the dependence on Q which interacts directly with the control cost metric H. 3.2. Approximate inference in the non-LQG case Let X be a random variable for which we can express the belief b(X) = i µji (Xi ) = X Xj fC (Xi , Xj ) Y k:k=i µkj (Xj ) , (18) ti (X) (30) where k indicates variables coupled to j other than i. Given all incoming messages to a variable, the posterior marginal belief is given as their product, bi (Xi ) := Y j µji (Xi ) . (19) as a product of incoming messages ti (X). Expectation Propagation (Minka, 2001b) is a method to iteratively improve the messages (and thereby the belief) when they are bound to be approximate. Let F be a family of distributions. We constrain every approximate mes^ sage ti F to be in this family of distributions, and Robot Trajectory Optimization using Approximate Inference hence also the approximate belief ^ F. We update a b ^ message ti in such a way that the updated belief ^ is b as close as possible to "when we would use the exact message", i.e., ^ ^ = argmin D b ti q b ^ ti qF (31) Algorithm 2 Approximate inference control (AICO) 1: Input: start state x0 , control costs H0:T , functions At (x), at (x), Bt (x), Rt (x), rt (x), convergence rate , threshold 2: Output: trajectory x0:T -1 -1 3: initialize s0 = x0 , S0 = 1e10, v0:T = 0, V0:T = 0, r0:T = 0, R0:T = 0, k = 0 4: repeat 5: for t = 1 : T do // forward sweep 6: update st and St using (20) 7: if k = 0 then 8: xt st ^ 9: else 10: xt (1 - )^t + bt ^ x 11: end if 12: access At (^t ), at (^t ), Bt (^t ), Rt (^t ), rt (^t ) x x x x x 13: update rt and Rt using (22) 14: update vt and Vt using (21) 15: update bt and Bt using (19) 16: if |^t - bt |2 > then x 17: t t - 1 // repeat this time slice 18: end if 19: end for 20: for t = T - 1 : 0 do // backward sweep 21: ..same updates as above... 22: end for 23: k k+1 24: until convergence where the left argument of the KL-divergence is the belief when we would replace the approximate message ^ ^ ti by the exact message. The updated message ti that ^ is corresponds to this updated belief b ^ ^ = bt , ^ b ^i i t ^ ^ b ti ^ ti = argmin D ti q ^ qF ^i t b (32) In the case of robot trajectory optimization, the exact messages ti that arise in typical problems are too complex to be expressed analytically. For instance, when conditioning a collision variable to no-collision, the implied message over the configuration space is extremely complex and, of course, discontinuous. In practice, the trajectory optimization process will be based on a (physical) simulator. In this paper we assume that this simulator can only provide us with the local approximations of the dynamics and costs around a specific configuration x in terms of the system matrices (or vectors) At (x), at (x), Bt (x), Rt (x), rt (x). Better simulators could potentially provide us with better, more precise messages that could be employed in an EP framework. Our assumption implies that all messages are Gaussian, as for the LQG case, but the crucial question is which x has been selected, i.e., around which robot configuration we quadratize/ linearize the system. We deviate from iLQG which iterates between globally (for all times t = 0 : T ) linearizing around a current trajectory x0:T , and then uses the Ricatti equation to recompute a better trajectory. In our framework we can locally update messages at a certain time step in analogy to EP until we reach local convergence before we move on to another time step. In principle, as with EP, we can freely choose the order in which to update messages including the time steps to focus on. We first initialize all messages µxt-1 xt , µxt+1 xt , µzt xt (xt ) for all times t to one, except for µx-1 x0 (x) = x0 x which encodes conditioning on the known start state. Then, for a certain time step t, we iterate updating all three messages as follows: (1) Choose a point of approximation xt depending on ^ the mode argmaxxt b(xt ) of the belief and compute the system matrices At (^t ), at (^t ), Bt (^t ), Rt (^t ), rt (^t ) x x x x x at this point. (2) Recompute the messages using this local approximation and equations (20-22). (3) Compute the current belief over xt , b(xt ) = µxt-1 xt (xt ) µxt+1 xt (xt ) µrt xt (xt ) . Algorithm 2 is an explicit instantiation of this inference scheme which we call Approximate Inference Control (AICO). 4. Experiments We test the methods iLQG, AICO, and spline-based gradient optimization on some standard robot motion problems under multiple task constraints. In this section we briefly explain the cost term ct (xt ) that is implied by these constraints. We assume to have three different kinematic mappings i : x yi which map the robot state x to different task variables yi which are the following: y1 R3 is the robot's finger tip position. We will constrain this to be close to a goal position for t = T . y2 R2 is the robot's center of gravity projected on the horizontal plane. We will constrain this to be close to zero throughout the trajectory which implies keeping balance over the foot support. y3 R is a scalar measuring collision danger; more precisely, if dj is the shortest distance between a pair j of collidable objects, then y3 = j (dj - )2 , with the heavy-side function and margin =0.02 meter. Robot Trajectory Optimization using Approximate Inference Table 1. For two reaching problems (1 & 2) and two cost parameter settings (a & b) we instantiated 10 instances by randomizing the target position. We give here the convergence time (=time to reach a cost less than .01 larger than the optimum) and cost with standard deviations. prob. 1a method AICO(=1,=.1) AICO(=.9,=.1) AICO(=.9,=) iLQG(=.8) AICO(=1,=.1) AICO(=.9,=.1) AICO(=.9,=) iLQG(=.8) AICO(=1,=.1) AICO(=.9,=.1) AICO(=.9,=) iLQG(=.8) AICO(=1,=.1) AICO(=.9,=.1) AICO(=.9,=) iLQG(=.8) time (sec) 3.52 ± 0.70 3.24 ± 0.52 3.56 ± 0.61 6.04 ± 1.14 2.24 ± 0.51 2.15 ± 0.51 2.29 ± 0.57 4.22 ± 1.79 4.18 ± 0.52 2.33 ± 0.65 4.09 ± 0.97 5.68 ± 0.92 3.26 ± 0.51 2.32 ± 1.20 2.68 ± 1.19 5.28 ± 0.88 cost 0.15 ± 0.03 0.15 ± 0.03 0.15 ± 0.03 0.15 ± 0.03 0.09 ± 0.01 0.09 ± 0.01 0.09 ± 0.01 0.09 ± 0.01 0.06 ± 0.01 0.06 ± 0.01 0.06 ± 0.01 0.06 ± 0.01 0.05 ± 0.00 0.05 ± 0.00 0.04 ± 0.00 0.05 ± 0.01 1b 2a 2b Figure 1. The two reaching problem scenarios. Top: Start postures; bottom: example final postures. We will constrain this to be close to zero throughout the trajectory. Our simulator allows us to compute i (x) and its Jacobian Ji (x) for any state. We assume we are given targets yi,0:T in each task space and (time-dependent) precisions i,t by which we want to follow the task ^ targets. Given a point of approximation xt and Ji = ^ Ji (^t ) we define x 3 ct (xt , ut ) = i=1 3 i,t [yi,t - i (xt )]2 + ut Ht ut i=1 3 i,t [yi,t ^^ ^ - i (^t ) + Ji xt - Ji xt ]2 + ut Ht ut x = i=1 i,t [xt Ji ^ ^ Ji xt - 2(yi,t - i (^t ) + Ji xt ) Ji xt ^^ ^ x + const] + ut Ht ut 3 (33) (34) Rt = i=1 3 i,t Ji ^ Ji ^ multiple criteria all are important: to reach the target without head collision, the robot should early on move its head to the right; and to keep balance on the left foot, the robot needs to stretch its right leg behind and twist the hip a bit. In the second scenario we investigate a more typical reaching problem: with the hand initially below a table the robot needs to move in a wide circle around the table to reach the finger target position. In all experiments we consider a trajectory of length 200. We investigated two sets of precision parameters: In case (a) we chose 1,T = 105 , -4 (we require a high precision of 105 for 1,0:T -1 = 10 the final finger tip position, and practically no endeffector precision during the intermediate trajectory) and 2,0:T = 3,0:T = 105 (we require even higher precision in the collision and balance variable throughout the trajectory). To investigate the effect of a weak "attracting force" on the endeffector target we also considered parameters 1,T = 102 in case (b). To collect statistics of the algorithms' performance we randomize the target position in each of 10 trials by adding 3D Gaussian noise with 5cm standard deviation. Figure 2 displays the optimization curves (i.e., the cost of the current trajectory after each iteration in iLQG, respectively cost of the ML trajectory after a fwd or bwd sweep in AICO) over computation time for both scenarios and parameter settings. We tried also a faster convergence rate = 0.9 for iLQG which lead to divergence of the algorithm in some cases. Generally, we can observe that the inference approach is converging fast ­ in particular it only needs very few initial iterations to reach good solutions. A trajectory rt = i=1 i,t Ji ^ (yi,t - i (^t ) + Ji xt ) ^^ x (35) We investigate simulated humanoid reaching problems in 2 different scenarios. In the first scenario (Figure 1 left), starting from an upright posture the humanoid needs to reach a target point (black dot) with his right index finger which is far below a big obstacle in front of his head. The scenario is interesting in that the Robot Trajectory Optimization using Approximate Inference 1000 100 cost above optimum 10 1 0.1 0.01 0.001 1e-04 0 1000 100 cost above optimum 10 1 0.1 0.01 0.001 1e-04 0 1 2 3 4 time (sec) 5 6 7 8 1 2 3 4 time (sec) 5 6 7 8 1000 100 cost above optimum 10 1 0.1 0.01 0.001 1e-04 0 1 2 3 4 time (sec) 5 6 7 8 AICO(0.9,0.1) iLQG(0.8) GRAD(spline) cost above optimum 1000 100 10 1 0.1 0.01 0.001 1e-04 0 1 2 3 4 time (sec) 5 6 7 8 AICO(0.9,0.1) iLQG(0.8) GRAD(spline) AICO(0.9,0.1) iLQG(0.8) GRAD(spline) AICO(0.9,0.1) iLQG(0.8) GRAD(spline) Figure 2. Cost over time during the optimization. Top: problem 1a&b; bottom: problem 2a&b. Costs are given in log scale and scaled so that zero cost corresponds to the best solution found by any method. Standard deviations are computed over 10 randomized problem instances ­ for better visibility in the log scaling, only upper errorbars indicating the standard deviation are drawn. with cost below 0.5 is with high probability collision free. Although each sweep of AICO needs more computation time than an iteration of iLQG, it converges faster measured in computation time. Our interpretation is that the adaptation steps made by AICO when updating a local belief are better estimates of the true local optimum, presumably because also the forward messages (including their uncertainty) are exploited. The gradient-based method, which does not exploit local 2nd order information, performs much worse, also because the spline encoding restricts the space of trajectories. Table 1 gives results for alternative parameter settings. Here we measure the computation time until a trajectory achieves a cost-above-optimum less than 0.01. AICO consistently performs better than iLQG. The smoothing ( < 1) and the repeated update of single time slices ( < ) significantly improve the performance of AICO. iLQG does not converge one some instances for close to 1. stochastic control. Other than previous approaches that tried to generalize Kalman's duality fully to nonLQG cases we aimed for a local approximate solution of the SOC problem and formulated a probabilistic trajectory model for which (1) the ML solution coincides with the deterministic optimal trajectory, and (2) the local Gaussian belief approximation is equivalent to the LQG perturbation model around the optimal trajectory. Based on this model we could derive an approximate message update algorithm in analogy to Expectation Propagation which outperforms previous local approximate SOC algorithms like iLQG. There are a number of possible extensions of the proposed algorithm. When the simulator returns more detailed information, e.g., on local collision planes, more precise messages (e.g., products of discontinuous boundaries with Gaussians) can be computed in the Expectation Propagation framework. Also non-Gaussian belief approximations in the exponential family can be integrated. Finally, better heuristics for the order of message updates might further increase efficiency. 5. Conclusion We presented an approach to use approximate inference methods for robot trajectory optimization and Robot Trajectory Optimization using Approximate Inference Acknowledgements This work was supported by the German Research Foundation, Emmy Noether fellowship TO 409/1-3. Cooper, G. (1988). A method for using belief networks as influence diagrams. Proc. of the Fourth Workshop on Uncertainty in Artificial Intelligence (pp. 55­63). Dayan, P., & Hinton, G. E. (1997). Using expectation maximization for reinforcement learning. Neural Computation, 9, 271­278. Kappen, B., Gomez, V., & Opper, M. (2009). Optimal control as a graphical model inference problem. Minka, T. (2001a). A family of algorithms for approximate bayesian inference. PhD thesis, MIT. Minka, T. P. (2001b). Expectation propagation for approximate Bayesian inference. Proc. of the 17th Annual Conf. on Uncertainty in AI (UAI 2001) (pp. 362­369). Murphy, K. (2002). Dynamic bayesian networks: Representation, inference and learning. PhD Thesis, UC Berkeley, Computer Science Division. Schlemmer, M., & Gruebel, G. (1998). Real-time collision- free trajectory optimization of robot manipulators via semi-infinite parameter optimization. Int. Journal of Robotics Research, 17, 1013­1021. probabilistic inference methods. Proc. of the Eigth Conf. on Uncertainty in Artificial Intelligence (pp. 276­283). A. Message derivations Proof. From equation (18) we have µxt-1 xt (xt ) R = xt-1dxt-1 P (xt | xt-1 ) µxt-2 xt-1 (xt-1 ) µzt-1 xt-1 (xt-1 ) R = xt-1dxt-1 N (xt | At-1 xt-1 + at-1 , Q + Bt-1 H -1 Bt-1 ) · N (xt-1 | st-1 , St-1 ) N [xt-1 | rt-1 , Rt-1 ] Using the Gaussian product rule the last two terms gives a Gaussian N (st-1 | R-1 rt-1 , St-1 + R-1 ) independent of xt t-1 t-1 which we can subsume in the normalization. What remains is µxt-1 xt (xt ) R xt-1dxt-1 N (xt | At-1 xt-1 + at-1 , Q + Bt-1 H -1 Bt-1 ) -1 -1 · N [xt-1 | St-1 st-1 + rt-1 , St-1 + Rt-1 ] = R xt-1 dxt-1 N (xt | At-1 xt-1 + at-1 , Q + Bt-1 H -1 Bt-1 ) -1 -1 -1 · N (xt-1 | (St-1 + Rt-1 )-1 (St-1 st-1 + rt-1 ), (St-1 + Rt-1 )-1 ) Shachter, R., & Peot (1992). Decision making using -1 -1 = N (xt | At-1 (St-1 + Rt-1 )-1 (St-1 st-1 + rt-1 ) + at-1 -1 , Q + Bt-1 H -1 Bt-1 + At-1 (St-1 + Rt-1 )-1 At-1 ) which gives the messages as in (20). µxt+1 xt (xt ) we have, µxt+1 xt (xt ) R = x dxt+1 P (xt+1 | xt ) µxt+2 xt+1 (xt+1 ) t+1 Concerning Shachter, R. D. (1988). Probabilistic inference and influence diagrams. Operations Research, 36, 589­ 605. Stengel, R. F. (1986). Stochastic optimal control. Wiley. Todorov, E. (2008). General duality between optimal control and estimation. In proceedings of the 47th ieee conf. on decision and control. Todorov, E., & Li, W. (2005). A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems. In Proc. of the american control conference, 300­306. Toussaint, M., & Goerick, C. (2007). Probabilistic inference for structured planning in robotics. Int Conf on Intelligent Robots and Systems (IROS 2007) (pp. 3068­3073). Toussaint, M., Harmeling, S., & Storkey, A. (2006). Probabilistic inference for solving (PO)MDPs (Technical Report EDI-INF-RR-0934). University of Edinburgh, School of Informatics. Yedidia, J., Freeman, W., & Weiss, Y. (2001). Understanding belief propagation and its generalizations. Zhang, J., & Knoll, A. (1995). An enhanced optimization approach for generating smooth robot trajectories in the presence of obstacles. Proc. of the 1995 European Chinese Automation Conference (pp. 263­ 268). London. · µzt+1 xt+1 (xt+1 ) = R xt+1 dxt+1 N (xt+1 | At xt + at , Q + Bt H -1 Bt ) · N (xt+1 | vt+1 , Vt+1 ) N [xt+1 | rt+1 , Rt+1 ] dxt+1 N (xt+1 | At xt + at , Q + Bt H -1 Bt ) -1 -1 · N [xt+1 | Vt+1 vt+1 + rt+1 , Vt+1 + Rt+1 ] R xt+1 -1 -1 = N (At xt + at | (Vt+1 + Rt+1 )-1 (Vt+1 vt+1 + rt+1 ) -1 , Q + Bt H -1 Bt + (Vt+1 + Rt+1 )-1 ) -1 -1 = N (xt | - A-1 at + A-1 (Vt+1 + Rt+1 )-1 (Vt+1 vt+1 + rt+1 ) t t -1 , A-1 [Q + Bt H -1 Bt + (Vt+1 + Rt+1 )-1 ]A- ) t t References Attias, H. (2003). Planning by probabilistic inference. Proc. of the 9th Int. Workshop on Artificial Intelligence and Statistics. Bryson, A. E., & Ho, Y.-C. (1969). Applied optimal control. Blaisdell Publishing Company. Chen, Y.-C. (1991). Solving robot trajectory planning problems with uniform cubic b-splines. Optimal Control Applications and Methods, 12, 247­262.