Learning Complex Motions by Sequencing Simpler Motion Templates Gerhard Neumann gerhard@igi.tugraz.at Wolfgang Maass maass@igi.tugraz.at Institute for Theoretical Computer Science, Graz University of Technology, A-8010 Graz, Austria Jan Peters Max Planck Institute for Biological Cybernetics, D-72076 T¨bingen, Germany u mail@jan-peters.net Abstract Abstraction of complex, longer motor tasks into simpler elemental movements enables humans and animals to exhibit motor skills which have not yet been matched by robots. Humans intuitively decompose complex motions into smaller, simpler segments. For example when describing simple movements like drawing a triangle with a pen, we can easily name the basic steps of this movement. Surprisingly, such abstractions have rarely been used in artificial motor skill learning algorithms. These algorithms typically choose a new action (such as a torque or a force) at a very fast time-scale. As a result, both policy and temporal credit assignment problem become unnecessarily complex - often beyond the reach of current machine learning methods. We introduce a new framework for temporal abstractions in reinforcement learning (RL), i.e. RL with motion templates. We present a new algorithm for this framework which can learn high-quality policies by making only few abstract decisions. can be decomposed into smaller, simpler segments. This sort of abstraction is for example often used by engineers for designing hybrid control solutions (Xu & Antsaklis, 2002) where the single segments are implemented as local, linear continuous controllers. We will call these building blocks motion templates. Other names that can be found in the literature are "motion primitives", "movement schemas", "basis behaviors" or "options" (Ijspeert et al., 2002; Arbib, 1981; Dautenhahn & Nehaniv, 2002; Sutton et al., 1999). Motor skill learning is a challenging problem for machine learning and, in particular, for the subfield of reinforcement learning (RL). Primarily used in motor skill learning is the flat RL setting without the use of abstractions. In this setting the agent has to choose a new action (typically a motor force or torque) at a very small sampling frequency. While this allows the representation of arbitrary policies, this flexibility makes the learning problem so complex that it is often beyond the reach of current methods. A common approach for limiting the potential complexity of the policy in the flat RL setting is to use a parametrized policy. Ijspeert et al. (2002) introduced a special kind of parametrized policies called motion primitives, which are based on dynamical systems. In most applications to date, only a single motion primitive is used for the whole movement. Parametrized policy search methods such as policy gradient descent and EM-like policy updates (Kober & Peters, 2009) have been used in order to improve single-stroke motor primitives. Currently, only few abstractions are used in RL algorithms for continuous environments, with few exceptions such as (Huber & Grupen, 1998; Ghavamzadeh & Mahadevan, 2003). In (Huber & Grupen, 1998) the policy acquisition problem is reduced to learning to coordinate a set of closed loop control strategies. In (Ghavamzadeh & Mahadevan, 2003) the given task is 1. Introduction Humans use abstractions to simplify the motor tasks occurring during their daily life. For example when describing simple movements like drawing a triangle with a pen, we can easily name the basic steps of this movement. In a similar manner, many complex movements Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Learning Complex Motions by Sequencing Simpler Motion Templates manually decomposed into a set of subtasks. Both, the lower-level subtasks and the higher-level subtaskselection policies are learned. In all these approaches the structure for the hierarchy of abstraction is manually designed and fixed during learning which limits the generality of these approaches. In our approach, an arbitrary parametrization of the abstracted level can be learned. In this paper, we introduce a new framework for abstraction in RL, i.e. RL with motion templates. Motion templates are our building blocks of motion. A template mp is represented as parametrized policy and executed until its termination condition is fulfilled. We assume that the functional forms of the motion templates remain fixed, and thus, our task is to learn the correct order and parameters of the motion templates by reinforcement learning. As motion templates are temporally extended actions, they can be seen as parametrized options in continuous time. There are a few well-established learning algorithms for the options framework (Sutton et al., 1999). However, these algorithms are designed for discrete environments. Choosing the parameters of a motion template is a continuous-valued decision. However, a single decision has now much more influence on the outcome of the whole motion than in flat RL. Thus, the decisions have to be made more precisely, though, the overall learning problem is simplified because much fewer decisions are needed to fulfill a task. As RL in continuous action spaces is already challenging in the flat RL setting, the requirement of learning highly-precise policies has limited the use of this sort of abstraction for motor control learning. This paper introduces a new algorithm which satisfies this requirement and therefore permits learning at an abstract level. The algorithm is based on the Locally-Advantage WEighted Regression (LAWER) algorithm. LAWER is a fitted Q-Iteration (Ernst et al., 2005) based algorithm which has been shown to learn high-quality continuous valued policies for many flat RL settings (Neumann & Peters, 2009). However, two substantial extensions are needed to render motion template learning possible. Firstly, we propose an improved estimation of the goodness of an state action pair. Secondly, we introduce an adaptive kernel, which is based on randomized regression trees (Ernst et al., 2005). We conduct experiments on 3 different tasks, a 1-link and a 2-link pendulum swing-up task and also a 2-link balancing task. 2. Motion Templates A motion template mp is defined by its kp dimensional parameter space p Rkp , its parametrized policy up (s, t; p ) (s is the current state, t represents the time spent executing the template and p p is the parameter vector) and its termination condition cp (s, t; p ). At each decision-time point k , the agent has to choose a motion template mp from the set A(k ) and also the parametrization p of mp . Subsequently the agent follows the policy pp (s, t; p ) until the termination condition cp (s, t; p ) is fulfilled. Afterwards, we obtain a new decision-time point k+1 . The functional forms of the policy up (s, t; p ) and the termination condition cp (s, t; p ) are defined beforehand and can be arbitrary functions. For example, consider again the task of drawing a triangle. We can define a motion template mline for drawing a line with the endpoint of the line and the velocity of moving the pen as parameters. The policy uline moves the pen from the current position with the specified velocity in the direction of the endpoint of the line. The template is terminated when the pen has reached a certain neighborhood of the endpoint. In our experiments, sigmoidal functions and linear controllers are used to model the motion templates. 2.1. Reinforcement Learning with Motion Templates Each motion template is a temporally extended, continuous valued action. Thus, we deal with a continuous-time Semi-Markov Decision Process (SMDP). We will review only the relevant concepts from the continuous-time SMDP framework. For a detailed definition, please refer to (Bradtke & Duff, 1995). Unlike in standard Markov Decision Processes (MDPs), the transition probability function P (s , d|s, a) is extended by the duration d of an action. The Bellman equation for the value function V (s) of policy is given by V (s) = a (a|s) (r(s, a)+ (1) exp(-t)P (s , t|s, a)V (s )dtds da, s t=0 where is the discount factor1 . The action value func1 In order to achieve the same discounting rate as in a flat MDP, can be calculated from the relation = Learning Complex Motions by Sequencing Simpler Motion Templates tion Q (s, a) is given by Q (s, a) = r(s, a)+ exp(-t)P (s , t|s, a)V (s )dtds . s t=0 (2) A policy is now defined as (mp , p |sk ). It can be decomposed into (mp |sk )p (p |sk ), where (mp |sk ) is the template selection policy and p (p |sk ) is the policy for selecting the parameters of template mp . Instead of using the max operator, a soft-max operator is used which can be efficiently approximated by an advantage-weighted regression. The advantageweighted regression solely uses the given state action pairs (si , ai ) to estimate the V-function and therefore avoids an exhaustive search in the action space. Stateaction pairs with an higher expected advantage2 have a higher influence on the regression. The regression uses the state vectors si as input ~ dataset, the Q-values Qk+1 (i) as target values and an additional weighting ui for each data point. The authors proved that the result of the advantageweighted regression is an approximation of the Vfunction V (s) = maxa Qk (s, a ). The weighting ui can be seen as goodness of using action ai in state si . ¯ ¯ It is estimated by ui = exp( A(si , ai )), where A(si , ai ) denotes the normalized advantage function and the parameter sets the greediness of the soft-max operator. We skip the description of the normalization of the advantage function, because, for this paper, it is enough to know that the normalization, and also the proof of the algorithm, assume normally distributed advantage ¯ values. For a more detailed description of A(si , ai ) please refer to (Neumann & Peters, 2009). LAWER uses Locally Weighted Regression (LWR, by Atkeson et al., 1997) for approximating the Q and the V-function. It therefore needs to be able to calculate the similarity wi (s) between a state si in the dataset H and state s. The state similarities wi (s) can be calculated by a Gaussian kernel wi (s) = exp(-(si - s)T D(si - s)). In this paper we also introduce an adaptive kernel in Section 4.1. For simplicity, we will denote wi (sj ) as wij for all sj H. Standard LWR is used to estimate the Q-function. The V-function is approximated by a combination of LWR and advantage-weighted regression. In order to do so, the advantage weighting ui is multiplicatively combined with the state similarity weighting, resulting again in a standard weighted linear regression. For the exact equations, please refer to (Neumann & Peters, 2009). The optimal policy (a|s) = N (a|µ(s), (s)) is modelled as stochastic policy with Gaussian exploration. The mean µ(s) can be determined by a similar locally and advantage-weighted regression, just the actions ai are used as targets instead of the Q-values. The covariance matrix (s) is given by calculating the advantageweighted covariance of locally neighbored actions. Intuitively speaking, the V-function is calculated by The advantage function is given by A(si , ai ) = Q(si , ai ) - V (si ) 2 3. Fitted Q-Iteration As LAWER is a Fitted Q-iteration (FQI) (Ernst et al., 2005; Riedmiller, 2005) based algorithm we quickly review the relevant concepts. FQI is a batch mode reinforcement learning (BMRL) algorithm. In BMRL algorithms we assume that all the experience of the agent up to the current time is given in the form H = {< si , ai , ri , s >}1iN . FQI estimates an optimal i control policy from this historical data. Therefore it approximates the state-action value function Q(s, a) by iteratively using supervised regression techniques. New target values for the regression are generated by ~ Qk+1 (i) = ri + Vk (s ), i = ri + max Qk (s , a ), i a (3) which are subsequently used to learn the Q-function Qk+1 (s, a). For more details please refer to (Neumann & Peters, 2009). 3.1. Fitted Q-Iteration for SMDPs For SMDPs we have to include the duration di of each action to our historical data H = {< si , ai , ri , di , s > i }1iN . Instead of using Equation 3, new Q-values can now be calculated by ~ Qk+1 (i) = ri + exp(-di ) max Qk (s , a ). i a (4) 3.2. Locally-Advantage-WEighted Regression (LAWER) A severe problem when using fitted Q-iteration for continuous action spaces is the use of the greedy operation Vk (s) = maxa Qk (s, a ) which is hard to perform. LAWER (Neumann & Peters, 2009) is a variant of FQI which avoids this max operator and is therefore well suited for continuous action spaces. The algorithm has been shown to learn high quality policies for many flat RL settings. exp(-t), where is the discount factor and t is the time step of the flat MDP. Learning Complex Motions by Sequencing Simpler Motion Templates interpolating between the Q-values of locally neighbored state action pairs, but only examples with a high goodness ui (i.e. high normalized advantage value) are used. The same is true for the policy, we just interpolate between the action vectors. The standard Extra-Tree algorithm builds an ensemble of regression trees. It has 3 parameters, the number M of regression trees, the number K of randomized splits to evaluate per node and the maximum number of samples per leaf nmin . For more details about the algorithm please refer to (Ernst et al., 2005). We use the trees for calculating the state similarities wij instead of approximating the Q-function. In order to do so, we learn the mapping from the states si to the V-values V (si ) with the Extra-Tree algorithm. The kernel is then given by the fraction of trees in which two states si and sj are located in the same leaf wij = 1 M M 4. Fitted Q-iteration for Motion Templates In order to apply the LAWER algorithm to the motion template framework we use a separate dataset H p and individual estimations Qp and V p of the Q and Vfunction for each motion template mp . The functions V p and Qp represent the state and state-action value function when choosing motion template mp in the first decision and subsequently following the optimal policy. We implement the template selection policy (mp |sk ) by a soft-max policy. The overall value function is determined by V (k ) = maxmp A(k ) V p (k ). LAWER is used to learn the single Q and V-function estimates Qp and V p . In this section we present two extensions which improve the accuracy of LAWER and render learning with motion templates possible. Firstly, adaptive treebased kernels are used to improve the estimation of the state similarities wij . This kernel also adapts to spatially varying curvatures of the regression surface and therefore needs an estimate of the V-function. Secondly, we show how to improve the estimate of the goodness ui by the use of an additional optimization. Based on the current estimate of the state similarities wij , new ui values, and subsequently also new estimates of the V-function are calculated. Both algorithms are applied intertwined to get improved estimates of wij and ui . 4.1. Adaptive Tree-based Kernels The use of an uniform weighting kernel is often problematic in the case of high dimensional input spaces ('curse of dimensionality'), spatially varying data densities or spatially varying curvatures of the regression surface. This problem can be alleviated by varying the 'shape' of the weighting kernel. We use the Extremely Randomized Tree (Extra-Tree) algorithm (Ernst et al., 2005) to obtain a varying kernel function. This algorithm has been particularly successful for approximating the Q-function in FQI. We modify this approach to calculate the weighting kernel. The resulting kernel has the same properties as the Extra-Trees, and therefore adapts to the local state density as well as to the local curvature of the V-function. isSameLeaf(Tk , si , sj ), k=1 (5) where Tk is the kth tree in the ensemble and isSameLeaf is a function returning 1 if both examples are located in the same leaf and 0 otherwise. In our experiments we will show the superiority of the treebased kernels to the Gaussian kernels. 4.2. Optimized LAWER As already pointed out in Section 3.2, LAWER assumes normally distributed advantage values. Often this assumption does not hold or the normalization of the advantages is imprecise due to too few data points in the neighborhood. This effect is even more drastic if high values are used because the inaccuracies may result in low activations in areas with a low sample density and therefore also in inaccurate regressions. This restriction on the parameter also limits the quality of the estimated policy. But how can we improve the estimation of the weightings ui ? Let us first consider a greedy policy D in a discrete environment. We formulate D as stochastic policy uij = D (aj |si ). The uij can be found by solving the following constraint optimization problem u = argmaxu subject to: uij A(si , aj ) uij = 1 for all states si j 0 uij 1 for all i, j, i,j (6) where u is the vector of all uij and A is again the advantage function. In our setting, we also have a finite number of state-action pairs (si , ai ), but typically all the states are different. However, the states are linked by the state similarities wij . The first constraint of the optimization problem can therefore be reformulated as wij uj = 1 for all states si , j (7) Learning Complex Motions by Sequencing Simpler Motion Templates while the remaining formulation of the optimization is unchanged. We also skipped the second index of uij because there is only one action for each state si . Due to this optimization we only use the ui with the highest advantage values while ensuring that the summed activation j wij uj is high enough at each state si for applying an accurate weighted linear regression. We solve the constraint optimization problem by maximizing the performance function C C= 1 Z uj (Q(sj , aj ) - V (sj ))- j same optimization defined in Equation 8, we just have to set to exp . With exp we can scale the exploration rate of the algorithm. 5. Results We evaluated the motion template approach on a 1link and a 2-link pendulum swing-up task and a 2link balancing task. For each task the immediate reward function was quadratic in the distance to the goal position xG and in the applied torque/force, i.e., r = -c1 |x - xG |2 - c2 |a|2 . For all our experiments we assume that the goal position xG is known. We collect L new episodes with the currently estimated exploration policy and one episode with the greedy policy (without exploration). After estimating the optimal policy, its performance is evaluated (without exploration) and the data collection is repeated. The initial distributions of the motion template parameters were set intuitively and were by no means optimal. We compared the motion template approach to flat RL with the standard LAWER algorithm. 5.1. Swing-Up Tasks In this task a pendulum needs to be swung up from the position at the bottom to the top position. 5.1.1. 1-link Pendulum The link of the pendulum had a length of 1m and a mass of 1kg, no friction was used. The used motion templates represent positive (m1 and m2 ) and negative peaks (m3 and m4 ) in the torque trajectory. There is also an individual template m5 for balancing the robot at the top position. One peak consists of 2 successive motion templates, one for the ascending and one for the descending part of the peak. The parametrization of the motion templates can be seen in Table 1. In order to form a proper peak, template m2 and m4 always start with the last torque ut taken in the end of the previous template. Therefore parameter a2 of these templates is already determined by ut and consequently the outcome of template m2 and m4 depend on ut . For this reason, the state space of template m2 and m4 was extended by ut . The balancing template m5 is implemented as linear PD-controller (see Table 1). The duration of the peak templates is an individual parameter of the templates (di ), m5 is always the final template and runs for 20s. Subsequently the episode is ended. The agent always started from the bottom position with motion template m0 . Afterwards it could either i ( j wij uj - )2 j (8) , wij with = 1, where Z is a normalization constant for the advantage values given by Z = i |Q(si , ai )|/N . The second term of Equation 8 specifies the squared summed activation error for each state si . It is normalized by the summed state-similarity of this state (i.e. j wij ). This ensures that the activation error is equally weighted throughout the state space, independent of the local state density. We also introduced a new parameter which sets the tradeoff between maximizing the greediness of ui or minimizing the summed activation error. It replaces the greediness parameter of the LAWER algorithm. The function C can be maximized with respect to ui using gradient ascent, the derivation of C is given by dC 1 = (Q(sk , ak ) - V (sk )) duk Z ( j wij uj - ) - 2 wik . j wij i (9) The learning rate for the gradient ascent algorithm is always chosen such that the maximum change of an activation ui is fixed to 0.01. After each gradient update the weights ui are restricted to the interval [0; 1]. The gradient ascent update is repeated for Nopt iterations, every Mopt << Nopt iterations the value estimates V (si ) are recalculated using the current weights ui . When using the tree-based kernels, we also recalculate the state similarities wij with the new estimate of V (si ). Typical values for Nopt and Mopt are 1000 and 100. The covariance matrix of the exploration policy is also calculated slightly differently to the original LAWER algorithm. We require that always the best exp locally neighbored actions are used. We therefore use a separate set of advantage weightings uexp for the covariance calculation which can be obtained by the Learning Complex Motions by Sequencing Simpler Motion Templates Table 1. MTs for the swing up motion. The functional forms resemble sigmoid functions. Parameter ai coresponds to the height of the peak, oi to the initial time offset and di to the duration of the motion template. k1 and k2 are the PD-controller constants of the balancer template. m3 and m4 resemble m1 and m2 except for a negative sign. The sketches illustrate the torque trajectories of these templates (x-axis: time, y-axis: acceleration). MT m0 m1,3 m2,4 m5 5 0 -5 5 0 -5 m0 m3 -20 Average Reward -40 Average Reward -50 -100 -150 -200 0 -60 -80 0 MT Tree MT Gauss Flat 20 40 60 Number of Data Collections MT Tree MT Gauss Flat 20 40 60 Number of Data Collections (a) (b) Functional Form a0 (1 - a1 ( 2 ) o 1+exp(o0 - d0 t) 0 Parameters a0 , o0 , d0 a1 , o1 , d1 Sketch Figure 2. Learning curves for the Gaussian kernel (MT Gauss) and the tree-based kernel (MT Tree) for (a) c2 = 0.025 and (b) c2 = 0.075 -25 Average Reward Average Reward -26 -27 -28 -29 -30 2 MT Tree 4 6 n 8 10 12 min 2 6o 1+exp(- d 1 t) 1 - 1) a2 (1 - 2 ) o 1+exp(o2 - d2 t) 2 o2 , d2 k1 , k2 -30 -k1 - k2 m5 c2 = 0.005 -35 -3 5 0 -5 5 0 -5 c2 = 0.005 MT Gauss MT Tree 10 -2 10 c2 = 0.025 10 -1 m0m3m4 m1 m5 c2 = 0.025 (a) (b) 5 0 -5 0 m0 m3m4m1m2 m3 m4 m1 1 5 0 c2 = 0.075 -5 2 3 4 5 0 Time [s] m5 c2 = 0.075 1 2 3 Time [s] 4 5 Figure 3. (a) Evaluation of the influence of for the Gaussian (MT Gauss) and the tree-based kernel (MT Tree, nmin = 5) (b) Evaluation of the nmin parameter for = 0.025. c2 was set to 0.025 for both evaluations. (a) (b) Figure 1. (a) Torque trajectories and motion templates learned for different action punishment factors c2 . (b) Torque trajectories learned with flat RL choose to use the peak templates in the predefined order (m3 , m4 , m1 , m2 , m3 ...) or use the balancing template m5 . Thus, the agent had to learn the correct parametrization of the motion templates and the number of swing-up motions. For all experiments a discount factor of = 0.2 was used, was set to 0.025 and exp to 20. For the Gaussian kernel we used a bandwidth matrix of D = diag(30, 3) for m1 , m3 and m5 and D = diag(30, 3, 1) for the extended state space of templates m2 and m4 . For the tree-based kernels we used the parameters nmin = 7, M = 80 and K = 20. For the comparison with the flat LAWER algorithm was set to 4 and a time step of 50ms was used. We used L = 50 episodes per data collection. We carried out 3 experiments with different torque punishment factors (c2 = 0.005, c2 = 0.025 and c2 = 0.075). We compared the learning process of flat RL, motion template learning with Gaussian state similarities (MT Gauss) and with adaptive tree-based state similarities (MT Tree) (see Figure 2). In the initial learning phase, the flat RL approach is superior to motion template learning, probably due to the larger number of produced training examples. However, RL with motion templates is able to produce policies of significantly higher quality and quickly outperforms the flat RL approach. This can also be seen in Figure 1(a) and (b), where the resulting torque trajectories are compared. Flat RL has difficulties particularly with the hardest setting (c2 = 0.075) where we received a maximum average reward of -48.6 for flat RL and -38.5 for the motion template approach. From Figure 2 we can also see that the tree-based kernel is much more sample efficient than the Gaussian kernel. An evaluation of the influence of the parameter can be seen in Figure 3(a) and of the parameter nmin of the tree-based kernel in Figure 3(b). The approach works robustly for a wide range of parameters. 5.1.2. 2-link Pendulum We also conducted experiments with a 2-link pendulum. The lenghts of the links were set to 1m, each link had a mass of 1kg (located at the center of the link). We use the same templates as for the 1-dimensional Learning Complex Motions by Sequencing Simpler Motion Templates m0 m3 m4 m5 5 Tourque [Nm] 0 u1 -5 1 Time [s] 2 u2 (a) (b) eters were chosen to loosely match the characteristics of a human, i.e. li = 1m and mi = 35kg. The hipjoint was limited to [-0.1; 1.5]rad and the ankle-joint to [-0.8; 0.4]rad. Whenever the robot left this area of the state space, we assumed that the robot had fallen, i.e. a negative reward of -10000 was given. The hiptorque was limited to ±500Nm and the ankle torque to ±70Nm. In the beginning of an episode, the robot stands upright and gets pushed with a certain force F . This results in an immediate jump of the joint velocities. The agent has to learn to keep balance for different perturbations. In (Atkeson & Stephens, 2007) this problem was solved exactly using Dynamic Programming techniques. The authors found out that two different balancing strategies emerge. For small perturbations, the ankle strategy, which uses almost only the ankle joint, is optimal. For larger perturbations (F > 17.5Ns), the ankle-hip strategy, which results in a fast bending movement, is optimal. In this experiment we want to reproduce both strategies by motion template learning. We use two motion templates to model the balancing behavior, both resemble linear controllers. The first motion template (m0 ) keeps the robot at the upright position and is similar to m5 from the previous experiment. The second template m1 additionally defines a set-point of the linear controller for each joint and a duration parameter d1 . In addition to the 8 controller gains, this results in 11 parameters. The agent can now choose to use m0 directly in the beginning or to use m1 and subsequently m0 . We used 4 different perturbations, i.e., F = 10, 15, 20 and 25Ns. For each perturbation, we collected L = 20 episodes. We again used the tree-based approach with the same parameter setting as in the previous experiment. The learning curve can be seen in Figure 5(b). The resulting torque trajectories are shown in Figure 6(a) and (b). We can clearly identify the ankle strategy for the two smaller perturbations and the ankle-hip strategy for larger perturbations using both motion templates. Figure 4. (a) Torque trajectories and decomposition in the motion templates for the 2-link pendulum swing-up task. (b) Illustration of the motion. The bold postures represent the switching time points of the motion templates. -10 Average Reward 0 Average Reward MT Tree Flat 50 100 150 Number of Data Collections -20 -30 -40 -50 -60 -70 0 -2000 -4000 -6000 0 50 100 Number of Data Collections (a) (b) Figure 5. Learning curves for motion template learning with tree-based kernels for the (a) 2-link swing-up task and the (b) 2-link balancing task. task, the peak templates have now 2 additional parameters, the height of the peak ai and the time offset oi for the second control dimension u2 . Including the duration parameter, this results in 5 parameters for m0 , m1 and m3 and 3 parameters for m2 and m4 . The parameters of the balancer template m5 consists of two 2 × 2 matrices for the controller gains. Experiments were done for the tree-based kernels with nmin = 8, = 0.025 and exp = 25. At each data collection, 50 new episodes were collected. For comparison to the flat RL approach we used a bandwidth matrix of D = diag(6.36, 2.38, 3.18, 1.06) and = 4. The evaluation of the learning process can be seen in Figure 5(a) and the learned motion and torque trajectories are shown in Figure 4. Also for this challenging task, the motion template approach was able to learn high-quality policies. While the flat RL approach stagnates at an average reward of -28.7, the motion template approach reaches an average reward of -15.6. 5.2. 2-link Balancing In this task a 2-link pendulum needs to be balanced at the top position after being pushed. The model param- 6. Conclusion and Future Work In this paper we proposed a new framework for temporal abstraction for RL in continuous environments, i.e. RL with motion templates. Learning the overall control task is decomposed into learning a sequence of simpler controllers. Because of the used abstractions the agent has to make fewer decisions, which simplifies the learning task. We strongly belief that this kind of abstractions may help scaling RL algorithms to more Learning Complex Motions by Sequencing Simpler Motion Templates uAnkle [Nm] uAnkle [Nm] 50 0 -50 50 0 -50 0 t = 0.72s 50 0 -50 500 0 F = 20Ns F = 25Ns sion problems. Advances in Neural Information Processing Systems 7, 7, 393­400. Dautenhahn, K., & Nehaniv, C. L. (2002). Imitation in animals and artifacts. Campridge: MIT Press. 1.5 t = 1.80s F = 10Ns F = 15Ns uHip [Nm] uHip [Nm] 0.5 Time [s] 1 1.5 t = 1.80s F = 25 -500 0 t = 0.72s 0.5 1 Time [s] t = 1.08s t = 1.44s t = 0.36s F = 15 t = 1.08s t = 1.44s t = 0.36s Ernst, D., Geurts, P., & Wehenkel, L. (2005). Treebased batch mode reinforcement learning. J. Mach. Learn. Res., 6, 503­556. Ghavamzadeh, M., & Mahadevan, S. (2003). Hierarchical policy gradient algorithms. Twentieth International Conference on Machine Learning (ICML2003) (pp. 226­233). Huber, M., & Grupen, R. A. (1998). Learning robot control--using control policies as abstract actions. In NIPS'98 Workshop: Abstraction and Hierarchy in Reinforcement Learning. Ijspeert, A., Nakanishi, J., & Schaal, S. (2002). Learning attractor landscapes for learning motor primitives. Advances in Neural Information Processing Systems 15 (NIPS2002) (pp. 1523­1530). Kober, J., & Peters, J. (2009). Policy search for motor primitives in robotics. Advances in Neural Information Processing Systems 22 (NIPS 2008) (pp. 849­856). MA: MIT Press. Neumann, G., & Peters, J. (2009). Fitted Q-iteration by Advantage Weighted Regression. Advances in Neural Information Processing Systems 22 (NIPS 2008) (pp. 1177­1184). MA: MIT Press. Riedmiller, M. (2005). Neural fitted Q-iteration - first experiences with a data efficient neural reinforcement learning method. Proceedings of the European Conference on Machine Learning (ECML) (pp. 317­ 328). Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and Semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, 181­211. Xu, X., & Antsaklis, P. (2002). An approach to optimal control of switched systems with internally forced switchings. Proceedings of the American Control Conference (pp. 148­153). Anchorage, USA. (a) (b) Figure 6. Learned solutions for the 2-link balancing problem for (a) F = 10Ns and F = 15Ns (ankle strategy) (b) F = 20Ns and F = 25Ns (ankle-hip strategy). The sketches bellow illustrate the temporal course of the balancing movement for the ankle strategy (a) and the anklehip strategy (b) complex domains. The motion templates approach also raises several interesting research questions to which we will dedicate our future work. For example, how can we efficiently add feedback to the motion templates? Which functional forms of the templates can facilitate learning? When do we terminate a motion template, in particular in the case of unforeseen events? Future work will also concentrate on applying the approach to more complex environments such as planar walking robots. 7. Acknowledgment Written under partial support by the Austrian Science Fund FWF project # P17229-N04 and the projects # FP7-216593 (SECO) and # FP7-506778 (PASCAL2) of the European Union. References Arbib, M. A. (1981). Perceptual structures and distributed motor control. Handbook of physiology, section 2: The nervous system vol. ii, motor control, part 1, 1449­1480. Atkeson, C., & Stephens, B. (2007). Multiple balance strategies from one optimization criterion. 7th IEEE-RAS International Conference on Humanoid Robots. Atkeson, C. G., Moore, A. W., & Schaal, S. (1997). Locally weighted learning. Artificial Intelligence Review, 11, 11­73. Bradtke, S. J., & Duff, M. O. (1995). Reinforcement learning methods for continuous-time markov deci-