Receding Horizon Differential Dynamic Programming
Yuval Tassa  Tom Erez & Bill Smart 

Abstract
The control of high-dimensional, continuous, non-linear dynamical systems is a key problem in reinforcement learning and control. Local, trajectory-based methods, using techniques such as Differential Dynamic Programming (DDP), are not directly subject to the curse of dimensionality, but generate only local controllers. In this paper,we introduce Receding Horizon DDP (RH-DDP), an extension to the classic DDP algorithm, which allows us to construct stable and robust controllers based on a library of local-control trajectories. We demonstrate the effectiveness of our approach on a series of high-dimensional problems using a simulated multi-link swimming robot. These experiments show that our approach effectively circumvents dimensionality issues, and is capable of dealing with problems of (at least) 34 state and 14 action dimensions.

1

Introduction

We are interested in learning controllers for high-dimensional, highly non-linear dynamical systems, continuous in state, action, and time. Local, trajectory-based methods, using techniques such as Differential Dynamic Programming (DDP), are an active field of research in the Reinforcement Learning and Control communities. Local methods avoid having to model the value function or policy over the entire state space by focusing computational effort along likely trajectories. Featuring algorithmic complexity polynomial in the dimension, local methods are not as directly affected by dimensionality as space-filling methods, and may thus possibly allow us to circumvent the curse. In this paper, we introduce Receding Horizon DDP (RH-DDP), a set of modifications to the classic DDP algorithm, which allow us to construct stable and robust controllers based on local-control trajectories in highly non-linear, high-dimensional domains. Our new algorithm, RH-DDP, is reminiscent of Model Predictive Control, and enables us to form a time-independent value function approximation along a trajectory. We aggregate several such trajectories into a library of locally-optimal linear controllers which we then select from, using a nearest-neighbor rule. Another new aspect of our approach is that we fit the successive value function approximations to a set of local samples around the trajectory. This explicitly takes into account the underlying non-linearity of the dynamics and also allows us to easily implement an arbitrary transformation to an internal coordinate space. Although we present several algorithmic contributions, a main aspect of this paper is a conceptual one. Unlike much of the recent work in this area, we are not interested in learning to follow a presupplied reference trajectory. We define a reward function which represents a global measure of performance relative to some goal, such as swimming towards a target. Instead of a reward based on the distance from a desired configuration, a notion which has its roots in the control community's definition of the problem, this form of reward imposes no restrictions on the final learned solution.


Preliminary version, accepted NIPS07. Y. Tassa is with the Hebrew University, Jerusalem, Israel.  T. Erez and W.D. Smart are with the Washington University in St. Louis, MO, USA.


1


We demonstrate the utility of our approach by learning controllers for a high-dimensional simulation of a planar, multi-link swimming robot. The swimmer is a model of an actuated chain of links in a viscous medium, with two location and velocity coordinate pairs, and an angle and angular velocity for each link. The controller must determine the applied torque, one action dimension for each articulated joint. We reward controllers that cause the swimmer to swim towards a target point and make contact with it, specified in the system's internal coordinate frame. We demonstrate our approach by synthesizing controllers for several swimmers, with state dimensions ranging from 14 to 34 dimensions. The controllers are shown to exhibit complex locomotive behavior in response to real-time simulated interaction with a user-controlled target. 1.1 Related work

Optimal control of continuous non-linear dynamical systems is a central research goal of the RL community. Even when important ingredients such as stochasticity and on-line learning are removed, the exponential dependence of computational complexity on the dimensionality of the domain remains a major computational obstacle. Methods designed to alleviate the curse of dimensionality include adaptive discretizations of the state space [1], and various domain-specific manipulations [2] which reduce the effective dimensionality. Local trajectory-based methods such as DDP, were introduced to the NIPS community in [3], where a local-global hybrid method is employed. Although DDP is used there, it is considered an aid to the global approximator, and the local controllers are constant rather than locally-linear. In this decade DDP was reintroduced by several authors. In [4] the idea of using the second order local DDP models to make locally-linear controllers is introduced. In [5] DDP was applied to the challenging high-dimensional domain of autonomous helicopter control, using a reference trajectory. In [6] a minimax variant of DDP is used to learn a controller for bipedal walking, again by designing a reference trajectory and rewarding the walker for tracking it. In [7], trajectory-based methods including DDP are examined as possible models for biological nervous systems. Local methods have also been used for purely policy-based algorithms [8, 9], without explicit representation of the value function. The best known work regarding the swimming domain is that by Ijspeert and colleagues (e.g. [10]) using Central Pattern Generators. While the inherently stable domain of swimming allows for such open-loop control schemes, articulated complex behaviors such as turning and tracking necessitate full feedback control which CPGs do not provide.

2 Methods
2.1 Definition of the problem

We consider the discrete-time dynamics xk+1 = F (xk , uk ) with states x  Rn and actions u  Rm . t In this context we assume F (xk, uk ) = xk + 0 f (x(t), uk )dt for a continuous f and a small t, approximating the continuous problem and identifying with it in the t  0 limit. Given some scalar reward function r(x, u) and a fixed initial state x1 (superscripts indicating the time index), we wish to find the policy which maximizes the total reward1 acquired over a finite temporal horizon:   (xk , k ) = argmax[
 (·,·)

iN
=k

r(xi ,  (xi , i))].

The quantity maximized on the R H S is the value function, which solves Bellman's equation: V (x, k ) = max[r(x, u) + V (F (x, u), k + 1)].
u

(1)

Each of the functions in the sequence {V (x, k )}N=1 describes the optimal reward-to-go of the optik mization subproblem from k to N . This is a manifestation of the dynamic programming principle. If N =  or if a receding horizon scheme is used, essentially eliminating the difference between different time-steps, the sequence collapses to a global, time-independent value function V (x).
1

Our choice of reward-maximization, rather than cost-minimization, is of course immaterial.

2


2.2

Standard DDP

Differential Dynamic Programming [11, 12] is an iterative improvement scheme which finds a locally-optimal trajectory emanating from a fixed starting point x1 . At every iteration, an approximation to the time-dependent value function is constructed along the current trajectory {xk }N=1 , k which is formed by iterative application of F using the current control sequence {uk }N=1 . Every k iteration is comprised of two sweeps of the trajectory: a backward and a forward sweep. In the backward sweep, we proceed backwards in time to generate local models of V in the following manner. Given quadratic models of V (xk+1 , k + 1), F (xk , uk ) and r(xk , uk ), we can approximate the unmaximised value function, or Q-function, Q(xk , uk ) = r(xk , uk ) + V k+1 (F (xk , uk )) (2) k k as a quadratic model around the present state-action pair (x , u ): Q  ( 1TT Qxu x xx k k Q(x +  x, u +  u)  Q0 + Qx  x + Qu  u + [ x  u ] 3) Qux Quu u 2 Where the coefficients Q are computed by equating coefficients of similar powers in the secondorder expansion of (2) k k k k+ k k k Qx = rx + Vx +1 Fx Qxx = rxx + Fx Vxx 1 Fx + Vx +1 Fxx k +1 k k k+1 k k+1 k (4) Qu = ru + Vx Fu Quu = ruu + Fu Vxx Fu + Vx Fuu k k+ k k k Qxu = rxu + Fx Vxx 1 Fu + Vx +1 Fxu . Once the local model of Q is obtained, the maximizing  u is solved for
1  u = argmax[Q(xk +  x, uk +  u)] = -Q-u (Qu + Qux  x) u u

(5)

and plugged back into (3) to obtain a quadratic approximation of V k : V0k = V0k+1 - Qu (Quu )-1 Qu
k Vx k Vxx

(6a) (6b) (6c)
k -1

= =

Qk+1 x Qk+1 xx

- Qu (Quu )

-1

Qux Qux .

- Qxu (Quu )

-1

This quadratic model can now serve to propagate the approximation to V . Thus, equations (4), (5) and (6) iterate in the backward sweep, computing a local model of the Value function along 1 with a modification to the policy in the form of an open-loop term -Q-u Qu and a feedback term u 1 -Q-u Qux  x, essentially solving a local linear-quadratic problem in each step. In some senses, DDP u can be viewed as dual to the Extended Kalman Filter (though employing a higher order expansion of F ). In the forward sweep of the DDP iteration, both the open-loop and feedback terms are combined to create a new control sequence (uk )N=1 which results in a new nominal trajectory (xk )N=1 . ^k ^k x1 = x1 ^ u =u ^
k+1 k k 1 - Q-u Qu u k k

(7a) -
1 Q-u Qux (xk ^ u

-x )

k

(7b)

x ^ = F (x , u ) ^^ (7c) We note that in practice the inversion in (5) must be conditioned. We use a Levenberg Marquardtlike scheme similar to the ones proposed in [13]. Similarly, the u-update in (7b) is performed with an adaptive line search scheme similar to the ones described in [14]. 2.2.1 Complexity and convergence The leading complexity term of one iteration of DDP itself, assuming the model of F as required for (4) is given, is O(N m1 ) for computing (6) N times, with 2 < 1 < 3, the complexity-exponent of inverting Quu . In practice, the greater part of the computational effort is devoted to the measurement of the dynamical quantities in (4) or in the propagation of co-location vectors as described below. DDP is a second order algorithm with convergence properties similar to, or better than Newton's method performed on the full vectorial uk with an exact N m × N m Hessian [15]. In practice, convergence can be expected after 10-100 iterations, with the stopping criterion easily determined as the size of the improvement plummets near the minimum. 3


2.2.2

Co-location Vectors

Here we propose a new method of obtaining the quadratic model of Q (Eq. (2)), inspired by [16]2 . Instead of using (4), we fit this quadratic model to samples of the value function at a cloud of co-location vectors {xk , uk }i=1..p , spanning the neighborhood of every state-action pair along the i i trajectory. We can directly measure r(xk , uk ) and F (xk , uk ) for each point in the cloud, and by i i i i using the approximation of the value function at the next time step, we can estimate the value of (2) at every point: q (xk , uk ) = r(xk , uk ) + V k+1 (F (xk , uk )) i i i i i i Then, we can insert the values of q (xk , uk ) and (xk , uk ) on the L H S and R H S of (3) respectively, i i i i and solve this set of p linear equations for the Q terms. If p > (3(n + m) + (m + n)2 )/2, and the cloud is in general configuration, the equations are non-singular and can be easily solved by a generic linear algebra package. There are several advantages to using such a scheme. The full nonlinear model of F is used to construct Q, rather than only a second-order approximation. Fxx , which is an n × n × n tensor need not be stored. The addition of more vectors can allow the modeling of noise, as suggested in [16]. In addition, this method allows us to more easily apply general coordinate transformations in order to represent V in some internal space, perhaps of lower dimension. The main drawback of this scheme is the additional complexity of an O(N p2 ) term for solving the p-equation linear system. Because we can choose {xk , uk } in way which makes the linear system i i sparse, we can enjoy the 2 < 1 of sparse methods and, at least for the experiments performed here, increase the running time only by a small factor. In the same manner that DDP is dually reminiscent of the Extended Kalman Filter, this method bears a resemblance to the test vectors propagated in the Unscented Kalman Filter [17], although we use a quadratic, rather than linear number of co-location vectors. 2.3 Receding Horizon DDP

When seeking to synthesize a global controller from many local controllers, it is essential that the different local components operate synergistically. In our context this means that local models of the value function must all model the same function, which is not the case for the standard DDP solution. The local quadratic models which DDP computes around the trajectory are approximations to V (x, k ), the time-dependent value function. The standard method in RL for creating a global value function is to use an exponentially discounted horizon. Here we propose a fixed-length nondiscounted Receding Horizon scheme in the spirit of Model Predictive Control [18]. Having computed a DDP solution to some problem starting from many different starting points x1 , we can discard all the models computed for points xk>1 and save only the ones around the x1 's. Although in this way we could accumulate a time-independent approximation to V (x, N ) only, starting each run of N -step DDP from scratch would be prohibitively expensive. We therefore propose the following. After obtaining the solution starting from x1 , we save the local model at k = 1 and proceed to solve a new N -step problem starting at x2 , this time initialized with the policy obtained on the previous run, shifted by one time-step, and appended with the last control unew = [u2 , u3 ...uN uN ]. Because this control sequence is very close to the optimal solution, the second-order convergence of DDP is in full effect and algorithm converges in 1 or 2 sweeps. Again saving the model at the first time step, we iterate. We stress the that without the fast and exact convergence properties of DDP near the maximum, this algorithm would be far less effective. 2.4 Nearest Neighbor control with Trajectory Library Along with the local models of V and the open-loop control sequence uk , we also obtain from a run 1 of DDP the feedback gain term {-Q-u Qux }k . This term in effect generalizes the policy to a tube u around the trajectory, inside of which a basin-of-attraction is formed. Having lost the dependency on the time k with the receding-horizon scheme, we need some space-based method of determining
2

Our method is a specific instantiation of a more general algorithm described therein.

4


which local gain model we select at a given state. The simplest choice, which we use here, is to select the nearest Euclidian neighbor. Outside of the basin-of-attraction of a single trajectory, we can expect the policy to perform very poorly and lead to numerical divergence if no constraint on the size of u is enforced. A possible solution to this problem is to fill some volume of the state space with a library of local-control trajectories [19], and consider all of them when selecting the nearest linear gain model.

3 Experiments
3.1 The swimmer dynamical system In this section we describe our variation of the d-link swimmer dynamical system, first presented in [20]3 . A stick or link of length l, lying on a horizontal plane at an angle  to some direction, c s() a -s ( ,  parallel to ^ = son() nd perpendicular to n = coin) moving with velocity x in a viscous t ^ i s( ) fluid, is postulated to admit a normal frictional force frn = -kn n(x · n) and a tangential frictional ^ ^ force frt = -kt^(x · ^), with kn > kt > 0. t t The swimmer is modeled as a chain of d such links of lengths li and masses mi , its configuration described by the generalized coordinates q = ( xcm ), of two center-of-mass coordinates and d  angles. Defining the positions of the link centers relative to the center of mass xi = xi - xcm , the Ż Lagrangian is i i i  2 mi + 1 L = 1 x2  mi x + 1 Ż Ii  2
2 cm 2 i 2 i

with the moments-of-inertia Ii =

1 2 12 mi li .

The relationship between the relative position vectors and angles of the links is given by the d - 1 equations xii+1 - xi = 1 li+1^i+1 + 1 li^i , which express the joining of successive links, and by the Ż Ż t 2 2t mi xi = 0 which comes from the definition of the xi 's relative to the center-of-mass. Ż Ż equation The function i i 3 2 1 1 ^ t F = - 2 kn [li (xi · ni )2 + 112 li i ] - 2 kt li (xi · ^i )2 known as the dissipation function, is that function whose derivatives W RT the qi 's provide the postu ¨ lated frictional forces. With these in place, we can obtain q from the 2+d Euler-Lagrange equations:
d  dt (  qi L)

=

  qi F 

+u

with u being the external forces and torques applied to the system. By applying d - 1 torques j in action-reaction pairs at the joints ui = i - i-1 , the isolated nature of the dynamical system is ¨ preserved. Performing the differentiations, solving for q, and setting the 4 + 2d-dimensional state qfi  variable x = q nally gives the full state dynamics x = f (x, u)4 .  3.1.1 Internal coordinates The two coordinates specifying the position of the center-of-mass and the d angles are defined relative to an external coordinate system, which the controller is denied access to. We make a ^ coordinate transformation into internal coordinates, where only the d - 1 relative angles {j = d-1 j +1 - j }j =1 are given, and the location of the target is given relative to coordinate system fixed on one of the links. This makes the learning isotropic and independent of a specific location on the plane. The co-location method allows us to perform this transformation directly on the vector cloud without having to explicitly differentiate it, as we would have had to using classical DDP. Note also that this transformation reduces the dimension of the state (one angle less), suggesting the possibility of further dimensionality reduction. 5


0 -0.5 -1 -1.5 -2 -2.5 -3 -3.5 -4 -4.5 -5 -5 -4 -3 -2 -1 0 1 2 3 4 5

| Figure 1: The function r(x) = -||x||2 / |x||2 + 1. The task of the swimmer is to reach the target with its "nose", an arbitrarily-designated joint. This figure shows how the reward (y-axis) grows linearly as the distance from the target (x-axis) approaches zero. This form ensures that the swimming speed will be fixed at large distances, and will slow down only close to the goal.

3.2 The reward function The reward function we used was r(x, u) = -cx | ||xnose ||2 |xnose ||2 + 1 - cu ||u||2 (8)

Where xnose is the vector from some designated point on the swimmer's body to the target, and cx and cu are positive constants. This reward is maximized when the nose is brought to rest on the target under a quadratic action-cost penalty. The functional form of the target-reward term is designed to be linear in ||xnose || when far from the target and quadratic when close to it (Figure 1). Because of the differentiation in Eq. (5), the solution is independent of V0 , the constant part of the value. Therefore, in the linear regime of the reward function, the solution is independent of the distance from the target, and all the trajectories are quickly compelled to converge to a onedimensional manifold in state-space which describes steady-state swimming. Upon nearing the target, the swimmer must initiate a braking maneuver, and bring the nose to a standstill over the target. For targets that are near the swimmer, the behavior must also include various turns and jerks, quite different from steady-state swimming, which maneuver the nose into contact with the target. We believe the behavioral variety that would be exhibited by a hypothetical optimal controller for this system to be extremely large.

4 Results
The demonstration of the results for this paper presents a particular challenge. Because we are concerned with learning effective behavior from a high-level reward signal, traditional performance metrics are inapplicable. We used no predesigned reference trajectory from which the deviation might be measured. We do not have access to an analytical closed form of the solution, against which results may be compared. At the points along the RH-DDP trajectories, we know that the learned policy is locally optimal. The feedback gain model, which provides us with a locally linear generalization of the policy in a volume around these points, becomes less accurate with distance. Though measurable, this discrepancy does not provide a useful performance metric since it is not clear how it correlates with the swimmer's ability to meet its goals. The ideal performance metric would be a measurement of the fraction of the state-space volume in which the controller is able to produce an effective policy. Since we do not have the analytical or numerical tools to perform an exhaustive measurement, we present a limited exploration of this global basin-of-attraction in the form of a movie, documenting the real-time interaction between the learned controller and a user moving the target with the mouse. The movie shows interaction with four different learned controllers, all of them consist of swimmers with link lengths and masses equal to 1. The normal and tangential frictional force coefficients were kn = 7.5 and kt = 0.3. The function F computes a single 4th-order Runge-Kutta integration step t+t of the continuous dynamics F (xk , uk ) = xk + t f (xk , uk )dt with t = 0.05s . The receding horizon window was of 40 time-steps, or 2s .
3 4

Our modification being the addition of the laminar friction term kt . A M AT L A B file which calculates this function is available upon request from the authors.

6


4.0.1

The movie

The movie is a screen-capture of a real-time interaction of a user with the system, implemented in M AT L A B 5 . The screen consists of two panes. In the left pane we draw the target and the swimmer, to observe its behavior. In the right pane, we draw the trajectories which constitute the controller, projected on the three largest eigenvectors of the normalized state covariance matrix. We also mark the current state of the swimmer, projected onto the same subspace, along with the point closest to it on any trajectory. This is the nearest neighbor point, whose local gain model provides the instantaneous policy. The first sequence shows a 5-link swimmer whose nose is at the head of the first link. The controller consists of a single RH-DDP trajectory, with the initial state lying several body-lengths away from the target. The movement along this trajectory involves swimming towards the target, coasting for a short time and then braking in order to bring the nose to rest on the target. The projection of this trajectory in the right pane is manifested as a spiral (swimming) which turns inward (coasting) and then turns out sharply in a direction orthogonal to the spiral (braking). If the user moves the target, the controller is able to cope only with states that are similar to the ones on the trajectory. When the user moves the target behind the swimmer, the generalization offered by the controller is no longer valid, and numerical divergence results. In the second sequence we see the same 5-link swimmer, now controlled with a library of 30 trajectories. Of these 30 trajectories, 27 had a close initial target distance, corresponding to a small jerk of the body, while 3 trajectories had their initial target distance further off. The behavior displayed by this controller is clearly sub-optimal at times, but the goal of bringing the nose to the target is nevertheless consistently obtained. In our experiments with this controller we could not drive it to numerical divergence with the real-time mouse-based interaction. The controller was able to produce a meaningful, effective policy over the entire volume of the state space we explored. In the third sequence we see a different 5-swimmer, whose nose is at the point joining the second and third links. This new configuration generates a qualitatively different behavior, both at the swimming phase and at the close-range-maneuver phase, thereby illustrating the richness of the swimmer domain. The parameters of the projection in the right pane were tuned to visually highlight the one-dimensional ring which corresponds to the optimal steady-state swimming gait (section 3.2). Again, we were unable to cause numerical divergence when interacting with this controller. The final sequence shows a 15-link swimmer of 34 state dimensions and 14 action dimensions. Problems of such high dimensionality are prohibitively hard for algorithms with complexity exponential in the dimension. Like the 5-link swimmer of the first sequence, this controller consists of a single trajectory and can be driven to numerical divergence by user interaction. However, the basin of attraction is large enough to produce effective target-following behavior.

References
[1] Remi Munos and Andrew W. Moore. Variable Resolution Discretization for High-Accuracy Solutions of Optimal Control Problems. In International Joint Conference on Artificial Intelligence, pages 1348­1355, 1999. [2] M. Stilman, C. G. Atkeson, J. J. Kuffner, and G. Zeglin. Dynamic programming in reduced dimensional spaces: Dynamic planning for robust biped locomotion. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation (ICRA 2005), pages 2399­2404, 2005. [3] Christopher G. Atkeson. Using local trajectory optimizers to speed up global optimization in dynamic programming. In NIPS, pages 663­670, 1993. [4] C. G. Atkeson and J. Morimoto. Non-parametric representation of a policies and value functions: A trajectory based approach. In Advances in Neural Information Processing Systems 15, 2003. [5] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng. An application of reinforcement learning to aerobatic helicopter flight. In Advances in Neural Information Processing Systems 19, 2007. [6] J. Morimoto and C. G. Atkeson. Minimax differential dynamic programming: An application to robust bipedwalking. In Advances in Neural Information Processing Systems 14, 2002. [7] Emanuel Todorov and Wei-Wei Li. Optimal control methods suitable for biomechanical systems. In 25th Annual Int. Conf. IEE Engineering in Medicine and Biology Society, 2003.
5

The movie and interaction package are available at the authors' web pages.

7


Figure 2: One snapshot from the movie attached to this submission. In the movie, the swimmer (left) propels itself and brings its "nose" (red dot) to the target (green dot in the middle), and then brakes its motion in order to keep its nose on the target. We then show the interactive mode of our simulation, where the user moves the target with the mouse cursor, with the swimmer following. On the right, we present all the 30 trajectories that constitute the library for the nearest-neighbor controller, projected on the three largest vectors of the spherized covariance matrix of all the points.

[8] R. Munos. Policy gradient in continuous time. Journal of Machine Learning Research, 7:771­791, 2006. [9] J. Peters and S. Schaal. Reinforcement learning for parameterized motor primitives. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2006), 2006. [10] A. Crespi and A. Ijspeert. AmphiBot II: An amphibious snake robot that crawls and swims using a central pattern generator. In Proceedings of the 9th International Conference on Climbing and Walking Robots (CLAWAR 2006), pages 19­27, 2006. [11] D. Q. Mayne. A second order gradient method for determining optimal trajectories for non-linear discretetime systems. International Journal of Control, 3:85­95, 1966. [12] D. H. Jacobson and D. Q. Mayne. Differential Dynamic Programming. Elsevier, 1970. [13] L.-Z. Liao and C. A. Shoemaker. Convergence in unconstrained discrete-time differential dynamic programming. IEEE Transactions on Automatic Control, 36(6):692­706, 1991. [14] S. Yakowitz. Algorithms and computational techniques in differential dynamic programming. Control and Dynamic Systems: Advances in Theory and Applications, 31:75­91, 1989. [15] L.-Z. Liao and C. A. Shoemaker. Advantages of differential dynamic programming over newton's method for discrete-time optimal control problems. Technical Report 92-097, Cornell Theory Center, 1992. [16] E. Todorov. Iterative local dynamic programming. Manuscript under review, available at www.cogsci.ucsd.edu/todorov/papers/ildp.pdf, 2007. [17] S. J. Julier and J. K. Uhlmann. A new extension of the kalman filter to nonlinear systems. In Proceedings of AeroSense: The 11th Int. Symp. on Aerospace/Defence Sensing, Simulation and Controls, 1997. [18] C. E. Garcia, D. M. Prett, and M. Morari. Model predictive control: theory and practice. Automatica, 25: 335­348, 1989. [19] M. Stolle and C. G. Atkeson. Policies based on trajectory libraries. In Proceedings of the International Conference on Robotics and Automation (ICRA 2006), 2006. [20] R. Coulom. Reinforcement Learning Using Neural Networks, with Applications to Motor Control. PhD thesis, Institut National Polytechnique de Grenoble, 2002.

8