Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods Alessandro Lazaric Marcello Restelli Andrea Bonarini Department of Electronics and Information Politecnico di Milano piazza Leonardo da Vinci 32, I-20133 Milan, Italy {bonarini,lazaric,restelli}@elet.polimi.it Abstract Learning in real-world domains often requires to deal with c ontinuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated throug h sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modi fies the actor's policy. The proposed approach has been empirically compared to othe r learning algorithms into several domains; in this paper, we report result s obtained in a control problem consisting of steering a boat across a river. 1 Introduction Most of the research on Reinforcement Learning (RL) [13] has studied solutions to finite Markov Decision Processes (MDPs). On the other hand, learning in real-world environments requires to deal with continuous state and action spaces. While several studies focused on problems with continuous states, little attention has been deserved to tasks involving continuous actions. Although several tasks may be (suboptimally) solved by coarsely discretizing the action variables (for instance using the tile coding approach [11, 12]), a different approach is required for problems in which high-precision control is needed and actions slightly different from the optimal one lead to very low utility values. In fact, since RL algorithms need to experience each available action several times to estimate its utility, using very fine discretizations may be too expensive for the learning process. Some approaches, although using a finite set of targ et actions, deal with this problem by selecting real-valued actions obtained by interpolation of the available discrete actions on the basis of their utility values [9, 14]. Despite of this capability, the learning performance of these algorithms relies on strong assumptions about the shape of the value function that are not always satisfied in highly non-linear control problems. The wire fitting algorithm [2] (later adopted also in [4]) tries to solve this problem by implementing an adaptive interpolation scheme in which a finite set of pairs action, v alue is modified in order to better approximate the action value fu nction. Besides having the capability of selecting any real-valued action, RL algorithms for continuous action problems should be able to efficiently find the greedy action, i.e., the action associated to the highest estimated value. Differently from the finite MDP case, a full search in a continuous action space to find the optimal action is often unfeasible. To overcome this problem, several approaches limit their search over a finite number of points. In order to k eep low this number, many algorithms (e.g., tile coding and interpolation-based) need to make (often implicit) assumptions about the shape of the value function. To overcome these difficulties, several approaches have adopted the actor1 critic architecture [7, 10]. The key idea of actor-critic me thods is to explicitly represent the policy (stored by the actor) with a memory structure independent of the one used for the value function (stored by the critic). In a given state, the policy followed by the agent is a probability distribution over the action space, usually represented by parametric fu nctions (e.g., Gaussians [6], neural networks [14], fuzzy systems [5]). The role of the critic is, on t he basis of the estimated value function, to criticize the actions taken by the actor, which consequently modifies its policy through a stochastic gradient on its parameter space. In this way, starting fr om a fully exploratory policy, the actor progressively changes its policy so that actions that yield higher utility values are more frequently selected, until the learning process converges to the optim al policy. By explicitly representing the policy, actor-critic approaches can efficiently implement the action selection step even in problems with continuous action spaces. In this paper, we propose to use a Sequential Monte Carlo (SMC) method [8] to approximate the sequence of probability distributions implemented by the actor, thus obtaining a novel actor-critic algorithm called SMC-learning. Instead of a parametric fun ction, the actor represents its stochastic policy by means of a finite set of random samples (i.e., actions) that, using simple resampling and moving mechanisms, is evolved over time according to the values stored by the critic. Actions are initially drawn from a prior distribution, and then they are resampled according to an importance sampling estimate which depends on the utility values learned by the critic. By means of the resampling and moving steps, the set of available actions gets more and more thick around actions with larger utilities, thus encouraging a detailed exploration of the most promising action-space regions, and allowing SMC-learning to find real continuous actions. It is worth pointing out that the main goal here is not an accurate approximation of the action-val ue function on the whole action space, but to provide an efficient way to converge to the continuous optimal policy. The main characteristics of the proposed approach are: the agent may learn to exec ute any continuous action, the action selection phase and the search for the action with the best estimated value are computationally efficient, no assumption on the shape of the value function is required, the algorithm is model-free, and it may learn to follow also stochastic policies (needed in multi-agent problems). In the next section, we introduce basic RL notation and briefly discuss issues about learning with continuous actions. Section 3 details the proposed learning approach (SMC-Learning), explaining how SMC methods can be used to learn in continuous action spaces. Experimental results are discussed in Section 4, and Section 5 draws conclusions and contains directions for future research. 2 Reinforcement Learning In reinforcement learning problems, an agent interacts wit h an unknown environment. At each time step, the agent observes the state, takes an action, and receives a reward. The goal of the agent is to learn a policy (i.e., a mapping from states to actions) that maximizes the long-term return. An RL problem can be modeled as a Markov Decision Proc ess (MDP) defined by a quadruple S, A, T , R, , where S is the set of states, A(s) is the set of actions available in state s, T : S × A × S [0, 1] is a transition distribution that specifies the probability of observing a certain state after taking a given action in a given state, R : S × A is a reward function that specifies the instantaneous reward when taking a given action in a given state, and [0, 1) is a discount factor. The policy of an agent is characterized by a probability dist ribution (a|s) that specifies the probability of taking action a in state s. The utility of taking action a in state s d following a policy therean , t-1 after is formalized by the action-value function Q (s, a) = E rt |s = s1 , a = a1 , t=1 where r1 = R(s, a). RL approaches aim at learning the policy that maximizes the action-value function in each state. The optimal as tion-value function can be computed by solving the Bellman equac tion: Q (s, a) = R(s, a) + T (s, a, s ) maxa Q (s , a ). The optimal policy can be defined as the greedy action in each state: (a|s) is equal to 1/|arg maxa Q (s, a)| if a arg maxa Q (s, a), and 0 otherwise. Temporal Difference (TD) algorithms [13] allows the computation of Q (s, a) by direct interaction with the environment. Given the tuple st , at , rt , st+1 , at+1 (i.e., the experience performed by the agent), at each step, action values may be estimated by online algorithms, such as SARSA, whose update rule is: Q(st , at ) (1 - )Q(st , at ) + u(rt , at+1 , st+1 ), (1 ) where [0, 1] is a learning rate and u(rt , at+1 , st+1 ) = rt + Q(st+1 , at+1 ) is the target utility. 2 Although value-function approaches have theoretical guarantees about convergence to the optimal policy and have been proved to be effective in many applications, they have several limitations: algorithms that maximize the value function cannot solve problems whose solutions are stochastic policies (e.g., multi-agent learning problems); small errors in the estimated value of an action may lead to discontinuous changes in the policy [3], thus leading to convergence problems when function approximators are considered. These problems may be overcome by adopting actor-critic methods [7] in which the action-value function and the policy are stored into two distinct representations. The actor typically represents the distributi on density over the action space through a function (a|s, ), whose parameters are updated in the direction of performance improvement, as established by the critic on the basis of its approximation of the value function, which is usually computed through an on-policy TD algorithm. 3 SMC-Learning for Continuous Action Spaces SMC-learning is based on an actor-critic architecture, in w hich the actor stores and updates, for each state s, a density distribution t (a|s) that specifies the agent's policy at time instant t. At the beginning of the learning process, without any prior inf ormation about the problem, the actor usually considers a uniform distribution over the action sp ace, thus implementing a fully exploratory policy. As the learning process progresses, the critic coll ects data for the estimation of the value function (in this paper, the critic estimates the action-value function), and provides the actor with information about which actions are the most promising. On t he other hand, the actor changes its policy to improve its performance and to progressively redu ce the exploration in order to converge to the optimal deterministic policy. Instead of using parametric functions, in SMC-learning the actor represents its evolving stochastic policy by means of Monte Carlo sampling. The idea is the following: for each state s, the set of available actions A(s) is initialized with N samples drawn from a proposal distribution 0 (a|s): A(s) = {a1 , a2 , · · · , aN }, ai 0 (a|s). Each sampled action ai is associated to an importance weight wi W (s) whose value is initialized to 1/N , so that the prior density can be approximated as (a|s) 0 iN wi · (a - ai ), =1 where ai A(s), wi W (s), and is the Dirac delta measure. As the number of samples goes to infinity this representation gets equivalent to the functional description of the original probability density function. This means that the actor can approximately follow the policy specified by the density 0 (a|s), by simply choosing actions at random from A(s), where the (normalized) weights are the selection probabilities. Given the continuous action-value function estimated by the critic and chosen a suitable exploration strategy (e.g., the Boltzmann exploration), it is possible to define the desired probability distribution over the continuous a ction space, usually referred to as the target distribution. As long as the learning process goes on, the action values estimated by the critic become more and more reliable, and the policy followed by the agent should change in order to choose more frequently actions with higher utilities. This means that, in each state, the target distribution changes according to the information collected d uring the learning process, and the actor must consequently adapt its approximation. In general, when no information is available about the shape of the target distribution, SMC methods can be effectively employed to approximate sequences of probability distributions by means of random samples, which are evolved over time exploiting importance sampling and resampling techniques. The idea behind importance sampling is to modify the weights of the samples to account for the differences between the target distribution p(x) and the proposal distribution (x) used to generate the samples. By setting each weight wi proportional to the ratio p(xi )/ (xi ), the discrete N weighted distribution i=1 wi · (x - xi ) better approximates the target distribution. In our contex t, the importance sampling step is performed by the actor, whic h modifies the weights of the actions according to their utility values estimated by the critic. When some samples have very small or very large normalized weights, it follows that the target densit y significantly differs from the proposal density used to draw the samples. From a learning perspectiv e, this means that the set of available 3 Algorithm 1 SMC-learning algorithm for all s S do Initialize A(s) by drawing N samples from 0 (a|s) Initialize W (s) with uniform values: wi = 1/N end for for each time step t do Action Selection PN Given the current state st , the actor selects action at from A(st ) according to t (a|s) = i=1 wi · (a - ai ) Critic Update Given the reward rt and the utility of next state st+1 , the critic updates the action value Q(st , at ) Actor Update Given the action-value function, the actor updates the importance weights if the weights have a high variance then the set A(st ) is resampled end if end for actions contains a number of samples whose estimated utility is very low. To avoid this, the actor has to modify the set of available actions by resampling new actions from the current weighted approximation of the target distribution. In SMC-learning, SMC methods are included into a learning algorithm that iterates through three main steps (see Algorithm 1): the action selection performe d by the actor, the update of the actionvalue function managed by the critic, and finally the update of the policy of the actor. 3.1 Action Selection One of the main issues of learning in continuous action spaces is to determine which is the best action in the current state, given the (approximated) action-valu e function. Actor-critic methods effectively solve this problem by explicitly storing the current policy. As previously described, in SMC-learning the actor performs the action selection step by taking one ac tion at random among those available in the current state. The probability of extraction of each a ction is equal to its normalized weight P r(ai |s) = wi . The time complexity of the action selection phase for SMC-learning is logarithmic in the number of actions samples. 3.2 Critic Update While the actor determines the policy, the critic, on the basis of the collected rewards, computes an approximation of the action-value function. Although several function approximation schemes could be adopted for this task (e.g., neural networks, regre ssion tress, support-vector machines), we use a simple solution: the critic stores an action value, Q(s, ai ), for each action available in state s (like in tabular approaches) and modifies it according to TD u pdate rules (see Equation 1). Using on-policy algorithms, such as SARSA, the time complexity of the critic update is constant (i.e., does not depend on the number of available actions). 3.3 Actor Update The core of SMC-learning is represented by the update of the p olicy distribution performed by the actor. Using the importance sampling principle, the actor m odifies the weights wi , thus performing a policy improvement step based on the action values computed by the critic. In this way, actions with higher estimates get more weight. Several RL schemes co uld be adopted to update the weights. In this paper, we focus on the Boltzmann exploration strategy [13]. The Boltzmann exploration strategy privileges the execution of actions with higher estimated utility values. The probabilities computed by the Boltzmann exploration can be used as weights for the available actions. At time instant t, the weight of action ai in state s is updated as follows: t wi +1 = t wi e N Qt+1 (s,ai ) Qt+1 (s,aj ) , (2 ) j =1 wj e where Qt+1 (s, ai ) = Qt+1 (s, ai ) - Qt (s, ai ), and the parameter (usually referred as to temperature) specifies the exploration degree: the higher , the higher the exploration. 4 Once the weights have been modified, the agent's policy has changed. Unfortunately, it is not possible to optimally solve continuous action MDPs by exploring only a finite set of actions sampled from a prior distribution, since the optimal action may not b e available. Since the prior distribution used to initialize the set of available actions significantly differs from the optimal policy distribution, after a few iterations, several actions will have negligibl e weights: this problem is known as the weight degeneracy phenomenon [1]. Since the number of samples should be kept low for efficiency reasons, having actions associated with very small weights means to waste learning parameters for approximating both the policy and the value function in regions of the action space that are not relevant with respect to the optimal policy. Furthermore, l ong learning time is spent to execute and update utility values of actions that are not likely to be optimal. Therefore, following the SMC approach, after the importance sampling phase, a resamplin g step may be needed in order to improve the distribution of the samples on the action domain. The degeneracy phenomenon can be measured through the effective sample size [8], which, for each state s, can be estimated by Nef f (s) = w 1 , (3 ) 2 wi i W (s) where wi is the normalized weight. Nef f (s) is always less than the number of actions contained in A(s), and low values of Nef f (s) reveal high degeneracy. In order to avoid high degeneracy, the actions are resampled whenever the ratio between the eff ective sample size Nef f (s) and the number of samples N falls below some given threshold . The goal of resampling methods is to replace samples with small weights, with new samples close to samples with large weights, so that the discrepancy between the resampled weights is reduced. T he new set of samples is generated by resampling (with replacement) N times from the following discrete distribution (a|s) = iN wi · (a - ai ), (4 ) =1 so that samples with high weights are selected many times. Among the several resampling approaches that have been proposed, here we consider the syste matic resampling scheme, since it can be easily implemented, takes O(N ) time, and minimizes the Monte Carlo variance (refer to [1] f or more details). The new samples inherit the same action values of their parents, and the sample weights are initialized using the Boltzmann distribution. Although the resampling step reduces the degeneracy, it int roduces another problem known as sample impoverishment. Since samples with large weights are replicated several ti mes, after a few resampling steps a significant number of samples could be identical. Furthermore, we need to learn over a continuous space, and this cannot be carried out using a discrete set of fixed samples; in fact, the learning agent would not be able to achieve the optimal policy whenever the initial set of available actions in state s (A(s)) does not contain the optimal action of that state. This limitation may be overcome by means of a smoothing step, that consists of moving the samples according to a continuous approximation (a|s, wi ) of the posterior distribution . The approximation is obtain ed by using a weighted mean of kernel densities: a , N 1i - ai (a|s, wi ) = wi K (5 ) h =1 h where h > 0 is the kernel bandwidth. Typical choices for the kernel densities are Gaussian kernels and Epanechnikov kernels. However, these kernels produce o ver-dispersed posterior distributions, and this negatively affects the convergence speed of the learning process, especially when a few samples are used. Here, we propose to use uniform kernels: ( . ai-1 - ai ) (ai+1 - ai ) Ki (a) = U (6 ) ; 2 2 As far as boundary samples are concerned (i.e., a1 and aN ), their corresponding kernel is set to K1 (a) = U [(a1 - a2 ); (a2 - a1 )/2] and KN (a) = U [(aN -1 - aN )/2; (aN - aN -1 )] respectively, thus preserving the possibility to cover the wh ole action domain. Using these (nonoverlapped) kernel densities, each sample is moved locally within an interval which is determined by its distances from the adjacent samples, thus achieving fast convergence. 5 200 180 160 140 120 100 80 60 40 20 0 0 20 40 60 80 100 120 140 160 180 200 current quay viability zone Parameter fc I sM AX sD p quay Zs width Zv width Value 1.25 0.1 2.5 1.75 0.9 (200, 110) 0.2 20 Alg. All All SARSA SMC SMC Cont.-QL Param. 0 / 0 / 0 / / Value 0.5/0.01 0.99 3.0/0.0001 0.95 25.0/0.0005 0.4/0.005 Figure 1: The boat problem. Table 1: The dynamics parameters. Table 2: The learning parameters. Besides reducing the dispersion of the samples, this resamp ling scheme implements, from the critic perspective, a variable resolution generalization approa ch. Since the resampled actions inherit the action value associated to their parent, the learned values are generalized over a region whose width depends on the distance between samples. As a result, at the beginning of the learning process, when the actions are approximately uniformly distributed, SMC- learning performs broad generalization, thus boosting the performance. On the other hand, when the le arning is near convergence the available actions tend to group around the optimal action, thus automatically reducing generalization which may prevent the learning of the optimal policy (see [12]). 4 Experiments In this section, we show experimental results with the aim of analyzing the properties of SMClearning and to compare its performance with other RL approa ches. Additional experiments on a mini-golf task and on the swing-up pendulum problem are repo rted in Appendix. 4.1 The Boat Problem To illustrate how the SMC-learning algorithm works and to as sess its effectiveness with respect to approaches based on discretization, we used a variant of the boat problem introduced in [5]. The problem is to learn a controller to drive a boat from the left b ank to the right bank quay of a river, with a strong non-linear current (see Figure 1). The boat's b ow coordinates, x and y , are defined in the range [0, 200] and the controller sets the desired direction U over the range [-90 , 90 ]. The dynamics of the boat's bow coordinates is described by the following equations: min(200, max(0, xt + st+1 cos(t+1 ))) min(200, max(0, yt - st-1 sin(t+1 ) - E (xt+1 ))) x x 2, where the effect of the current is defined by E (x) = fc 50 - 100 where fc is the force of the current, and the boat angle t and speed st are updated according to the desired direction Ut+1 as: t+1 t+1 st+1 t+1 = = = = t + I t+1 t + ((t+1 - t )(st+1 /sM AX )) st + (sD - st )I min(max(p(Ut+1 - t ), -45 ), 45 ) xt+1 yt+1 = = where I is the system inertia, sM AX is the maximum speed allowed for the boat, sD is the speed goal, is the rudder angle, and p is a proportional coefficient used to compute the rudder angl e in order to reach the desired direction Ut . The reward function is defined on three bank zones. The success zone Zs corresponds to the quay, the viability zone Zv is defined around the quay, and the failure zone Zf in all the other bank points. Therefore, the reward function is defined as: (x, y ) Zs +1 0 D(x,y) (x, y ) Zv (7 ) R(x, y ) = (x, y ) Zf -1 0 0 otherwise 6 Sarsa vs SMC-learning 10 8 Total Reward 6 4 2 0 0 20 SMC-learning (5 samples) SMC-learning (10 samples) Sarsa (5 actions) Sarsa (10 actions) Sarsa (20 actions) Sarsa (40 actions) 40 60 Episodes (x1000) 80 100 Total Reward 10 8 6 4 2 0 0 QL-Continuous, Tile coding, SMC-learning SMC-learning (10 samples) QL-Continuous (40 actions) Tile coding (80 actions) 20 40 60 Episodes (x1000) 80 100 Figure 2: Performance comparison between SMC-learning and Q-learning (left), SMC-learning and tile coding and Continuous Q-learning (right) where D is a function that gives a reward decreasing linearly from 10 to -10 relative to the distance from the success zone. In the experiment, each state variable is discretized in 10 intervals and the parameters of the dynamics are those listed in Table 1 . At each trial, the boat is positioned at random along the left bank in one of the points shown in Figure 1. In the following, we compare the results obtained with four different algorithms: S ARSA with Boltzmann exploration with different discretizations of the action space, SARSA with tile coding (or CMAC) [12], Continuous Q-learning [9], and SMC-learning. The learning parameters of each algorithm are listed in Table 2. 1 Figure 2-left compares the learning performance (in terms of total reward per episode) for SARSA with 5, 10, 20, and 40 evenly distributed actions to the results obtained by SMC-learning with 5 and 10 samples. As it can be noticed, the more the number of actions available the better the performance of SARSA is. With only 5 actions (one action each 36 ), the paths that the controller can follow are quite limited and the quay is not reachable fro m any of the starting point. As a result, the controller learned by SARSA achieves a very poor performance. On the other hand, a finer discretization allows the boat to reach more frequentl y the quay, even if it takes about three times the number of episodes to converge with respect to the c ase with 5 actions. As it can be noticed, SMC-learning with 5 samples outperforms SARSA with 5 and 10 actions both in terms of performance and in convergence time. In fact, after few tria ls, SMC-learning succeeds to remove the less-valued samples and to add new samples in regions of the action space where higher rewards can be obtained. As a result, not only it can achieve better performance than SARSA, but it does not spend time exploring useless actions, thus improving also the convergence time. Nonetheless, with only 5 samples the actor stores a very roughly approxima ted policy, which, as a consequence of resampling, may converge to actions that do not obtain a performance as good as that of SARSA with 20 and 40 actions. By increasing the number of samples from 5 to 10, SMC-learning succeeds in realizing a better coverage of the action space, and obtai ns equivalent performance as SARSA with 40 actions. At the same time, while the more actions available, the more SARSA takes to converge, the convergence time of SMC-learning, as in the case with 5 samples, benefits from the initial resampling, thus taking less than one sixth of the trials needed by SARSA to converge. Figure 2-right shows the comparison of the performance of SMC-learning, SA RSA with tile coding using two tilings and a resolution of 2.25 (equivalent to 80 actions), and Continuous Q-learning with 40 actions. We omit the results with fewer actions because both tile coding and Continuous Q-learning obtain poor performance. As it can be noticed, SM C-learning outperforms both the compared algorithms. In particular, the generalization ov er the action space performed by tile coding negatively affects the learning performance because of the non-linearity of the dynamics of the system. In fact, when only few actions are available, two adjacent actions may have completely different effects on the dynamics and, thus, receive different rewards. Generalizing over these actions prevents the agent from learning which is the best action amo ng those available. On the other hand, as long as the samples get closer, SMC-learning dynamically reduces its generalization over the ac1 x is the decreasing rate for parameter x, whose value after N trials is computed as x(N ) = x(0) . 1+x N 7 tion space, so that their utility can be more accurately esti mated. Similarly, Continuous Q-learning is strictly related to the actions provided by the designer a nd to the implicit assumption of linearity of the action-value function. As a result, although it could learn any real-valued action, it does not succeed in obtaining the same performance as SMC-learning even with the quadruple of actions. In fact, the capability of SMC-learning to move samples towards more rewarding regions of the action space allows the agent to learn more effective policies even with a very limited number of samples. 5 Conclusions In this paper, we have described a novel actor-critic algori thm to solve continuous action problems. The algorithm is based on a Sequential Monte Carlo approach that allows the actor to represent the current policy through a finite set of available actions a ssociated to weights, which are updated using the utility values computed by the critic. Experimental results show that SMC-learning is able to identify the highest valued actions through a process of importance sampling and resampling. This allows SMC-learning to obtain better performan ce with respect to static solutions such as Continuous Q-learning and tile coding even with a very limited number of samples, thus improving also the convergence time. Future research activity wil l follow two main directions: extending SMC-learning to problems in which no good discretization of the state space is a priori known, and experimenting in continuous action multi-agent problems. References [1] M. Sanjeev Arulampalam, Simon Maskell, Neil Gordon, and Tim Clapp. A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracki ng. IEEE Trans. on Signal Processing, 50(2):174­188, 2002. [2] Leemon C. Baird and A. Harry Klopf. Reinforcement learning with high-dimensional, continuous actions. Technical Report WL-TR-93-117, Wright-Patterson Air Force Base Ohio: Wright Laboratory, 1993. [3] D.P. Bertsekas and J.N. Tsitsiklis. Neural Dynamic Programming. Athena Scientific, Belmont, M A, 1 9 9 6 . [4] Chris Gaskett, David Wettergreen, and Alexander Zelins ky. Q-learning in continuous state and action spaces. In Australian Joint Conference on Artificial Intelligence, pages 417­428, 2003. [5] L. Jouffe. Fuzzy inference system learning by reinforce ment methods. IEEE Trans. on Systems, Man, and Cybernetics-PART C, 28(3):338­355, 1998. [6] H. Kimura and S. Kobayashi. Reinforcement learning for continuous action using stochastic gradient ascent. In 5th Intl. Conf. on Intelligent Autonomous Systems, pages 288­295, 1998. [7] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. SIAM Journal on Control and Optimization, 42(4):1143­1166, 2003. [8] J. S. Liu and E. Chen. Sequential monte carlo methods for dynamical systems. Journal of American Statistical Association, 93:1032­1044, 1998. [9] Jose Del R. Millan, Daniele Posenato, and Eric Dedieu. Continuous-action q-learning. Machine Learning, 49:247­265, 2002. [10] Jan Peters and Stefen Schaal. Policy gradient methods for robotics. In Proceedings of the IEEE International Conference on Intelligent Robotics Systems (IROS), pages 2219­2225, 2006. [11] J. C. Santamaria, R. S: Sutton, and A. Ram. Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior, 6:163­217, 1998. [12] Alexander A. Sherstov and Peter Stone. Function approx imation via tile coding: Automating parameter choice. In SARA 2005, LNAI, pages 194­205. Springer Verlag, 2005. [13] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [14] Hado van Hasselt and Marco Wiering. Reinforcement learning in continuous action spaces. In 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning, p ag es 2 7 2 ­ 2 7 9 , 2 0 0 7 . 8