Trajectory Prediction: Learning to Map Situations to Robot Trajectories Nikolay Jetchev Marc Toussaint TU Berlin, Franklinstr. 28/29,10587 Berlin, Germany jetchev@cs.tu-berlin.de mtoussai@cs.tu-berlin.de Abstract Trajectory planning and optimization is a fundamental problem in articulated robotics. Algorithms used typically for this problem compute optimal trajectories from scratch in a new situation. In effect, extensive data is accumulated containing situations together with the respective optimized trajectories ­ but this data is in practice hardly exploited. The aim of this paper is to learn from this data. Given a new situation we want to predict a suitable trajectory which only needs minor refinement by a conventional optimizer. Our approach has two essential ingredients. First, to generalize from previous situations to new ones we need an appropriate situation descriptor ­ we propose a sparse feature selection approach to find such wellgeneralizing features of situations. Second, the transfer of previously optimized trajectories to a new situation should not be made in joint angle space ­ we propose a more efficient task space transfer of old trajectories to new situations. Experiments on a simulated humanoid reaching problem show that we can predict reasonable motion prototypes in new situations for which the refinement is much faster than an optimization from scratch. icy" which maps "the situation" (or at least motion relevant features of the situation) to the whole trajectory.1 Such a mapping (if optimal) is utterly complex: the output is not a single current control signal but a whole trajectory which, traditionally, would be the outcome of a computationally expensive trajectory optimization process accounting for collision avoidance, smoothness and other criteria. The input is the current situation, in particular the position of relevant objects for which it is unclear which representation and coordinate systems to use as a descriptor. The goal of this work is to learn such an (approximate) mapping from data of previously optimized trajectories in old situations. We coin this problem trajectory prediction. Existing approaches to trajectory optimization from scratch include traditional gradient-based methods, in particular using a spline-based representation (Zhang & Knoll, 1995), or sequential quadratic programming of which iLQG (Todorov & Li, 2005) is an instance. These methods efficiently converge to local optima which is appropriate when the cost function implies suitable gradients, e.g. to push out of collisions. In more complex situations other methods like Rapidlyexploring Random Trees (RRTs) (Bertram et al., 2006) or probabilistic road maps (Kavraki et al., 1995) are typically used to first find feasible (e.g. collisionfree) paths, which are then refined w.r.t. to smoothness and costs using a local optimizer. Such methods are a black-box ingredient to our approach; in the experiment we will mainly utilize iLQG and mention also RRTs. Concerning our problem of learning from previous optimization data, there exist multiple branches of related work in the literature. In the context of Rein1 This is not to be confused with a reactive controller which maps the current sensor state to the current control signal ­ such a (temporally local) reactive controller could not explain trajectories which efficiently circumvent obstacles in an anticipatory way, as humans naturally do in complex situations. 1. Introduction The animal and human ability to generate trajectories quickly is amazing. In typical every-day situations humans do not seem to require time for motion planning but execute complex trajectories instantly. This suggests that there exists a "reactive trajectory polAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Trajectory Prediction: Learning to Map Situations to Robot Trajectories forcement Learning the transfer problem has been addressed, where the value function (Konidaris & Barto, 2006) or directly the policy (Stolle & Atkeson, 2007), (Peshkin & de Jong, 2002) is transferred to a new Markov Decision Process. Konidaris and Barto (2006) also discussed the importance of representations for the successful transfer. Although the problem setting is similar, these methods are different in that they do not consider a situation descriptor (or features of the "new" MDP) as an input to a mapping, which directly predicts the new policy or value function. Related work with respect to exploiting databases of previous trajectories has been proposed in the context of RRTs. Branicky et al. (2008) constructed a compact database of collision free paths that can be reused in future situations to speed up planning under the assumption that some of the previous paths will not be blocked by future obstacles and can be reused for fast planning. Martin et al. (2007) attempted to bias RRTs such that after planning in a set of initial environments, the obstacles can be rearranged and previous knowledge will be used for faster replanning in the new scene; an environment prior, that visits with higher probability states visited in previous trials, is used to speed up planning and use less tree nodes to achieve the final goal. In both cases, the notion of our situation descriptor and the direct mapping to an appropriate new trajectory is missing. Another interesting way to exploit a database of previous motions is to learn a "capability map", i.e., a representation of a robot's workspace that can be reached easily, see (Zacharias et al., 2007). While this allows to decide whether a certain task position can be reached quickly, it does not encode a prediction of a trajectory in our sense. The question of what are suitable representations of a physical configuration, in particular suitable coordinate systems, has previously been considered in a number of works. Wagner et al. (2004) discussed the advantages of egocentric versus allocentric coordinate systems for robot control, and (Hiraki et al., 1998) talked about such coordinates in the context of robot and human learning. The choice of the situation descriptor is also crucial when trying to generalize data from old to new situations. Our approach is to use feature selection methods to decide on relevant and well-generalizing situation features. To do this, we redundantly blow up the situation descriptor by defining a vector that contains all kinds of distances in various coordinate systems that can be calculated for the current situation ­ namely a 791-dimensional vector we will define later. The remainder of the paper is organized as follows. We first provide some background, which sets an appropriate framework to formalize our problem. In section 3 we present our approach to learn a trajectory prediction in new situations, to transfer and refine them. In section 4 we evaluate the approach on some humanoid reaching problems. 2. The Trajectory Prediction Problem Let qt Rn be the robot posture as given by all its joint angles at time t. In a given situation x, i.e., for a given initial posture q0 and the positions of obstacles and targets in this problem instance (we will formally define x below), the problem is to compute a trajectory q = (q0 , .., qT ) for some time horizon T , which fulfils different constraints like reaching a task goal and avoiding collisions. We formulate this as an optimization problem by defining a cost function T C(x, q) = t=1 g(qt ) + h(qt , qt-1 ) . (1) We will specify such a cost function explicitly in our experiments section. Generally, g will account for task targets and collision avoidance and h for control costs. A trajectory optimization algorithm essentially tries to map a situation x on a trajectory q which is optimal, x q = argmin C(x, q) . q (2) For this we assume to have access to C(x, q) and local (linear or quadratic) approximations of C(x, q) as provided by a simulator, i.e., we can numerically evaluate C(x, q) for given x and q but we have no analytic model. The problem we address in this paper is to learn an approximate model of the mapping (2) from a data set of previously optimized trajectories. The dataset D comprises pairs of situations and optimized trajectories, D = {(xi , q i )d } , i=1 q i argmin C(xi , q) . q (3) As an aside, this problem setup generally reminds at structured output regression. However, in a structured output scenario one devises a discriminative function C(x, q) for which argminq C(x, q) can efficiently be computed, e.g. by inference methods. Our problem is quite the opposite: we assume argminq C(x, q) is very expensive to evaluate and thus learn from a data set of previously optimized solutions. A possibility to bring Trajectory Prediction: Learning to Map Situations to Robot Trajectories both problems together is to devise approximate, efficiently computable structured models of trajectories and learn the approximate mapping in a structured regression framework. But this is left to future research. In this paper we will show the advantages of doing trajectory prediction by adapting basic classification and regression methods to work with pre-calculated values of C(x, q) from costly simulations. 3. Prediction and Transfer of a Trajectory to a New Situation The problem of trajectory prediction as defined in the previous section is generally very hard. We decompose the whole problem into a number of steps which we describe one after another in this section. As an overview, these steps are: 1. Compute the high-dimensional descriptor x for the new situation. 2. Use a learnt mapping to predict a (task space) trajectory for the new situation x. 3. Transfer the trajectory to the new situation using Inverse Kinematics (IK), generating a trajectory q. 4. Use a trajectory optimization algorithm to refine q, i.e., initialize the algorithm with q and compute the optimal trajectory q . We measure the performance of this procedure in terms of how much refinement is needed in the last step (the other steps are computationally cheap). Since the optimizer we use in the refinement (iLQG) is an iterative algorithm, we measure this "refinement cost" as follows J specified by the initial robot posture q0 and the positions of obstacles and targets in this problem instance. There are many ways to describe a situation by some vector x. For instance, positions of obstacles could be given relative to some world coordinate system, relative to the robot's base or relative to the endeffector. We should expect that our ability to generalize to new situations crucially depends on how we describe situations. Our approach is to first define a very highdimentional and redundant situation descriptor which includes distances and relative positions w.r.t. many possible frames of reference. Given this descriptor we use a feature selection technique to infer from the data which of these dimensions are best for trajectory prediction in new situations. Concretely, in our reaching task experiments, x Rs is defined as a s=791-dimensional vector. We have 31 robot joint angles in the posture vector q0 . We have 20 objects for which we measure pairwise distances and rotations, 18 body parts (upper body of our robot), the obstacle object (a single table in our scenario) and the reach target location. This makes 190 combinations of such object pairs. For each pair i we measure the 3D relative distance pi = (px , py , pz ) of the object centers i i i and the 1D cosine oi of the z axis relative orientation of the objects. (q0 , p1 , ..., p190 , o1 , ..., o190 ) R791 (5) Even larger spaces are possible, for example taking polar coordinates and relative object coordinate frames instead of the world frame, but the choice of (5) turned out to be sufficient in our experiments. 3.2. Trajectory Prediction As in typical kernel machines, at the core of a good predictor is a good choice of similarity measure (kernel) in input space, see (Scholkopf & Smola, 2001). We consider rather basic prediction methods for (2)­ namely nearest neighbour (NN) and locally weighted regression (LWR) ­ but spend some effort in training a suitable similarity measure in situation space. Given the data set D = {(xi , q i )d } and a new situation x , i=1 we want to compute a similarity between x and each xi in the data set, 1 k(x , xi ) = exp{- (x - xi ) W (x - xi )} 2 2 2 W = diag(w1 , .., ws ) , (6) (7) F (x, q) = j=1 C(x, q j ) (4) where q j is the trajectory vector found by the optimizer after j iterations, starting from initialization q 0 q, and J is a constant (we set J = 40 as a good compromise between giving a good estimate of the convergence and fast evaluation). This area measure can be interpreted as the area under the optimization curve. 3.1. High-dimensional Situation Descriptor Step 1 is to compute a high-dimensional situation descriptor. A situation (or problem instance) is fully 2 where the entries wj of the diagonal metric parametrize the weighting of the jth dimension in the situation descriptor. Given this metric, the NN pre- Trajectory Prediction: Learning to Map Situations to Robot Trajectories dictor is f (x) = Txi x q^ , i ^ = argmax k(xi , x) , i i ~ Pi w = exp{- 1 (x - xi ) W (x - xi )} 2 w ~ = -Pi (x - xi ) diag(w) 2 (14) (8) ~ where Pi = exp{- 1 (x - xi ) W (x - xi )} is the un2 ¯ normalized probability. Given a second dataset D of situations for which we measured all costs F of situations from D, we want to minimize: E{F (x, f (x))} + |w|1 ¯ xD where Txi x = -1 xi is a transfer operator which x first projects to a task space in situation xi and then projects back to a joint trajectory in situation x ­ we will explain this operator in detail in the next section. For K nearest neighbours, the LWR predictor is f (x) = -1 x K i (15) k(x, xi ) xi q i K j k(x, xj ) . (9) Note that for K = 1 the two methods are identical. For higher values of K the similarity k(x, xi ) controls how we average between the nearest neighbours in task space and we remain always in the convex hull of these neighbours, which assures meaningful trajectory outputs. We also tried Kernel Ridge Regression (KRR) but observed worse performance, in particular when the new situation is far from all previously seen (all k(x, xi ) are small) and KRR essentially returns the global mean of all trajectories ­ this global interpolation typically seemed not useful. We train the parameters w = (w1 , ..., ws ) on the basis of the performance measure F (x, q) defined in (4). Assume we had a probabilistic model f (x) that only outputs (transferred) trajectories q i from the database and selects them with probability P (f (x) = Txi x q i ) = 1 1 exp{- (x - xi ) W (x - xi )} Z 2 (10) where controls the trade-off with the regularization. Having the squares of w on the diagonal of W ensures that we get a positive matrix W with an unconstrained minimization algorithm. We used the L1 norm of w to get a sparse solution and took sign(w) as its gradient, see (Schmidt et al., 2007). 3.3. Task Space Transfer In the previous sections we already used the transfer operator Txi x that transfers a data trajectory from situation xi to a new situation x. We propose to do this transfer in task space. Assume we have a kinematic mapping x : qt yt which maps a joint configuration qt Rn to a task vector yt Rm , for instance the endeffector position of the robot. Generally, such a mapping depends on the situation x, e.g. the positions of other objects. Then we can project a joint space trajectory q into the task space; we write y = x (q) RT +1×m for this, meaning that is applied in every time slice. In a new situation (where we know the initial posture q0 ), this task space trajectory can be projected back to joint space using inverse kinematics (IK), see (Nakamura & Hanafusa, 1987). Stretching rigorous mathematical notation a bit, we write this IK projection of a task trajectory as -1 (y). x In this notation, the transfer operator is the concatenation Tx x = -1 x . x Generally, the task space can be freely defined in this approach. In our experiments we chose the task space to be the 3D endeffector position relative to an obstacle (table) coordinate system. This choice of task space helps to generalize to translations in obstacle position, see (Berniker & Kording, 2008). We will explain more details on the used inverse kinematics in the experimental section. 3.4. A Clustering Approach to Prediction As an alternative to the above prediction scheme and for empirical evaluation we also test a simple clustering approach where we choose from a small predefined movement set: where Z = i exp{- 1 (x - xi ) W (x - xi )} ensures 2 normalization. The expected cost of this probabilistic mapping over D is d X i=1 E{F (x, f (x))} = P (f (x) = Txi x q i )F (x, Txi x q i ) . (11) We choose weights w so that E{F (x, f (x))} is minimal for all x. Our approach has some analogies with other kernel training and feature selection methods, see (Lowe, 1995).The derivative of this expected cost, with respect to the vector w can be calculated as: E{F (x, f (x))} w d X i=1 = F (x, f (x)) P (f (x) = Txi x q i ) w (12) P (f (x) = Txi x q i ) w Z = ~ Pi w - P (f (x) = Txi x q i ) Z2 Pd ~ Pj w (13) j=1 Trajectory Prediction: Learning to Map Situations to Robot Trajectories Figure 1. Low dimensional embedding of D using Euclidean distance matrix between xi (q i ). Colours indicate assigned cluster. The situations xa and xb are visualized in Figure 2 Figure 3. Prototype trajectories y i , tracing good avoidance paths around the obstacle. 4. Experiments 4.1. Problem setup The scenario we examined contains a humanoid robot body with the right hand index finger as endeffector, a reaching target (red point) and an obstacle (the table) as seen in Figure 2. The task for the robot is to reach the target with the endeffector without colliding with the obstacle. Different scenarios are generated by uniformly sampling the position of the table in a rectangle of size (0.9, 02, 0.2), the target in (0.5, 0.2, 0.6), and initial endeffector position in (0.3, 0.3, 0.9). Situations with initial collisions and too easy situations where the endeffector was closer than 0.3 to the target were discarded in order to avoid trivial situations and to put a greater focus on more challenging scenarios, where the endeffector must move on the other side of the table to reach the target. With this generative model we gathered a database D with 1000 situations for use in our experiments. We set T = 200 to allow for a smooth and detailed movement description. We chose the term h in the cost function (1) to enforce a trajectory of short length with smooth transitions between the trajectory steps. We define h as h(qt , qt-1 ) = (qt - qt-1 )2 . The cost term g in (1) is defined as g(qt ) = g1 (qt ) + g2 (qt ) (17) (16) (a) Situation xa ­ move (b) Situation xb ­ move hand under the table. hand over the table. Figure 2. Two situations; the goal is to reach the target. 1. Cluster the task space trajectories xi (q i ) using standard Euclidean distance in c clusters with centroids {y i }c . i=1 2. Gather training data pairs (x, i ) where i = argminiC F (x, -1 y i ). x 3. Train a supervised classification algorithm on this data, so we can predict f (x) = -1 y i . x We were motivated for this approach by the good low dimensional structure we found in D, see Figure 1, where different regions correspond to characteristic movements and situations as in Figure 2. More elaborate ways to code generalized movements and primitives have been proposed, like Hidden Markov Models (Calinon & Billard, 2005; Shon et al., 2007), but cluster analysis has the advantage of simplicity. The task space trajectories y i correspond to prototype avoidance paths around the obstacle in Figure 3, since mapping xi projects successfully optimized trajectories in relative endeffector-table coordinates. where g1 penalizes collisions while executing the grasp movement. The value of this collision cost is the sum of the pairwise penetration depths ci of colliding objects. Minimizing it moves the robot body parts away from obstacles. g1 (qt ) = 105 i c2 i (18) The task of reaching the target position with the endeffector is represented in g2 . We want the target to be Trajectory Prediction: Learning to Map Situations to Robot Trajectories reached at the end of the movement, so we define this cost function to have a higher value for t = T : g2 (qt ) = d2 104 d2 t