Nonparametric Return Distribution Approximation for Reinforcement Learning Tetsuro Morimura Masashi Sugiyama Hisashi Kashima Hirotaka Hachiya Toshiyuki Tanaka tetsuro@jp.ibm.com sugi@cs.titech.ac.jp kashima@mist.i.u-tokyo.ac.jp hachiya@sg.cs.titech.ac.jp tt@i.kyoto.ac.jp IBM Research - Tokyo, 1623-14 Shimotsuruma, Yamato-shi, Kanagawa, 242-8502, Japan Tokyo Institute of Technology, 2-12-1 O-okayama, Meguro-ku, Tokyo, 152-8552, Japan The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan Kyoto University, 36-1 Yoshida-Honmachi, Sakyo-ku, Kyoto-shi, Kyoto, 606-8501, Japan Abstract Standard Reinforcement Learning (RL) aims to optimize decision-making rules in terms of the expected return. However, especially for risk-management purposes, other criteria such as the expected shortfall are sometimes preferred. Here, we describe a method of approximating the distribution of returns, which allows us to derive various kinds of information about the returns. We first show that the Bellman equation, which is a recursive formula for the expected return, can be extended to the cumulative return distribution. Then we derive a nonparametric return distribution estimator with particle smoothing based on this extended Bellman equation. A key aspect of the proposed algorithm is to represent the recursion relation in the extended Bellman equation by a simple replacement procedure of particles associated with a state by using those of the successor state. We show that our algorithm leads to a risksensitive RL paradigm. The usefulness of the proposed approach is demonstrated through numerical experiments. return, where the return is defined as the cumulative (discounted) total of immediate rewards. Most of the theories in RL have been developed for working with the expected return as the objective function. However, users are sometimes interested in controlling the risk. For example, since maximizing the expected return may accept rare occurrences of large negative outcomes, some users, especially those who plan to apply a RL algorithm to practical applications, prefer a policy avoiding (or constraining) small chances of suffering a large loss (Geibel & Wysotzki, 2005; Defourny et al., 2008). On another front, some users will prefer a robust criterion against outlier events, as in cases where an estimate of the expected return will be severely affected by rare occurrences of those events (Sugiyama et al., 2009). In this paper, we describe an approach to handling various risk-sensitive and/or robust criteria in a unified manner, where the distribution of the possible returns is approximated and then the criteria are evaluated based on the approximated distribution. In order to achieve this, our approach is to first extend the Bellman equation for conditional expectations of the returns to cover conditional cumulative distributions of the returns, as shown in Section 3. We next describe in Section 4 a nonparametric approach with particles to solve the Bellman equations for distributions, which can be regarded as an extension of the conventional temporal-difference learning. In Section 5, we demonstrate how to apply our approach to a risk-sensitive RL scenario, by focusing on the expected shortfall, also called conditional value-at-risk (CVaR), of returns as an example of risk-sensitive criteria. In Section 6, 1. Introduction Most reinforcement learning (RL) methods attempt to find decision-making rules that maximize the expected Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Nonparametric return distribution approximation for RL numerical experiments show that the proposed algorithms are promising for return distribution approximation and also in a risk-sensitive RL scenario. 2. Background of Value-based RL We briefly review the framework of value-based RL and our motivation to approximate the return distributions. 2.1. Markov Decision Process (MDP) An RL problem is usually defined on a discretetime Markov decision process (MDP) (Bertsekas, 1995; Sutton & Barto, 1998). The MDP is defined by the quintuplet (S, A, pT , PR , ), where S s and A a are finite sets of states and actions, respectively. The state transition probability pT : S × A × S [0, 1] is a function of a state s, an action a, and the successor state s+1 , i.e., pT (s+1 |s, a) Pr(s+1 |s, a).1 Here, we describe the state s+k and the action a+k as a state and an action after k time-steps from the state s and the action a, respectively. An immediate reward r R is distributed according to the reward probability PR : R × S × A × S [0, 1], which is a function of r, s, a, and s+1 , i.e., PR (r|s, a, s+1 ) Pr(R r|s, a, s+1 ).2 The policy : A × S [0, 1] is a probability function of a given s, which defines the decision-making rule of the learning agent, i.e., (a|s) Pr(a|s). There are various policy models with a value function, such as the greedy, -greedy, or soft-max action selection models (Sutton & Barto, 1998). 2.2. From expected returns to return distributions Let us define the return as the following timediscounted cumulative reward with a discount rate [0, 1), T T the policy is fixed, the MDP is regarded as a Markov chain M() {S, A, pT , PR , , }. We write the (conditional) cumulative distribution functions of the returns as PE : R × S × A × M [0, 1] and ¯ : R × S × M [0, 1], where M is a family of PE Markov chains, PE ( | s, a) ¯ PE ( | s) Pr(E | s, a, M()), aA (a|s) Pr(E | s, a, M()). A statistic for the return from a state (or a stateaction pair) is called a value function of the state (or the state-action pair). The objective of RL is formalized as maximizing the value function over the states. Although there would be various returnstatistics to be considered for the value function, in most value-based RL problems, the expected return conditioned on a state s or a state-action pair (s, a) is used for the value function, such as V (s) Q (s, a) E [ | s, a], E [ | s], (2) where E [·] denotes the expectation over the Markov chain M(). The function Q (s, a) is called the stateaction value or Q-value function and often serves a basis of the policy. However, decision making that relies only on information on the expected return will be insufficient for the control of the risk. These value functions are often inappropriate for the risk-sensitive criteria, as described in Section 1. In addition, the expected return is not robust against outliers, i.e., these value functions can be severely affected by rare occurrences of large noises that could be included in the reward or state observations (Sugiyama et al., 2009). Altogether, the major drawback of the ordinary RL approach follows from the fact that the approach ignores all information about the return except for the expectations. In actuality, if we have an approximation for the return distribution, this gives us access to a lot of information about the return and allows us to handle various risk-sensitivity or robustness criteria for the return. This is why we focus on the approximation of the return distribution. 2.3. Related work involving the return distribution There are several approaches to estimating other statistics for the return distribution beyond the expected return. In the seminal paper of Dearden et al. (1998), a recursive formula for the second moment of the return is shown when the reward is defined by a deterministic function of the state-action lim t r+t , (1) t=0 where r+0 is the original r. Alternatively, the return T t could be defined as t=0 r+t with a bounded T and [0, 1] in an episodic problem. The return is observed by the learning agent with (infinite) time delay and usually is a random variable E, reflecting randomness due to pT , PR , and . Once 1 Although to be precise it should be Pr(S+1 = s+1 |S = s, A = a) for the random variables S+1 , S, and A, we often write Pr(s+1 |s, a) for simplicity. The same simplification is used for other distributions. 2 While r(s, a, s+1 ) may often be deterministic, we assume r to be stochastic, in order to consider a more general case of RL. All results presented here are applicable to the deterministic reward case as well. Nonparametric return distribution approximation for RL pair, E { 2 |s} = r2 + 2V (s+1 ) + 2 E { 2 |s+1 }. They developed a Bayesian learning method in which the return distribution is estimated with the normal-gamma distribution. However, this approach sometimes requires numerical integration and requires extensive computation time, while our proposed algorithm does not require any numerical integration. As a similar line to this work, some mean-variance model-free RL algorithms were developed by Sato et al. (2001), Engel et al. (2005), and Ghavamzadeh & Engel (2007), and were successful in the variance penalized MDPs. However, these meanvariance-based algorithms assume that the return distribution is Gaussian, which may not be true in practice. In contrast, our nonparametric approach using particles has many degrees of freedom as regards the return distribution model. The random variables R = r and E+1 = +1 in the above equation are conditionally independent given the successor state s+1 , since these are distributed as r PR (r|s, a, s+1 ) and +1 PE (+1 |s+1 , a+1 ), respectively, where "x PX (x)" means that x is distributed according to a probability function PX (x). Therefore, by considering the convolution integral with respect to r, this proposition (Eq. (3)) is proved as Pr(E | s, a) = s+1 S a+1 A = Pr(R + E+1 | s, a) pT (s+1 |s, a)(a+1 |s+1 ) × = rR Pr(E+1 - r | s+1 )dPR (r|s, a, s+1 ) pT (s+1 |s, a)(a+1 |s+1 ) -r | s+1 dPR (r|s, a, s+1 ). 3. Distributional Bellman Equation for Cumulative Return Distribution We derive a Bellman-type recursive formula for the return distribution, comparing it with the ordinary Bellman equation for the Q-value function. From the definitions of the return (Eq. (1)) and the Q-value function (Eq. (2)), the following equation is derived (Sutton & Barto, 1998), Q (s, a) s+1 S s+1 S a+1 A × rR Pr E+1 The distributional Bellman equation for the conditional return distribution given a state-action pair (Eq. (3)) is directly extended to that given a state as ¯ PE (|s) = aA s+1 S pT (s+1 |s, a) + a+1 A r dPR (r|s, a, s+1 ) rR (a|s)pT (s+1 |s, a) (4) (a+1 |s+1 )Q (s+1 , a+1 ) . × ¯ - r | s+1 dPR (r|s, a, s+1 ). PE rR This equation is well-known as the Bellman equation for the Q-value function. Similarly, we can derive a Bellman-type recursive formula for the conditional cumulative distribution of the returns given a stateaction pair, which we call the (cumulative) distributional Bellman equation for the returns. Proposition 1 The (cumulative) distributional Bellman equation for the return is given as PE (|s, a) These conditional return distributions are in principle evaluated by solving the distributional Bellman equations, just as the (ordinary) Bellman equation gives the Q-value function (i.e., the conditional expected return given a state-action pair) as its solution. 4. Return Distribution Approximation with Nonparametric Model To approximately solve the distributional Bellman equations, we propose a nonparametric method based on a particle smoothing (Doucet et al., 2001), called the return distribution particle smoother (RDPS), in which particles are used to approximate the return distributions. An eligibility trace technique for RDPS is also discussed. Note that, in this section, we will discuss an approximation only for the conditional return distribution given a state. However, these results are directly generalized and applied to the case of the conditional return distribution given a state-action pair. = s+1 S a+1 A pT (s+1 |s, a)(a+1 |s+1 ) (3) × PE rR -r | s+1 , a+1 dPR (r|s, a, s+1 ). Proof: Here we use a strict notation only for the re turn distribution such as Pr(E |s, a) := PE (|s, a). From the definition of the return in Eq. (1), the recursive form with respect to the return is = r + +1 . Nonparametric return distribution approximation for RL 4.1. General view of return distribution approximation with distributional Bellman equation A possible approach for approximating the return distribution would be to use a Monte Carlo sampling technique. However, this would require an enormous number of samples and have high computational costs since each observation of the return has multiple time delays. As an alternative, the distributional Bellman equation as given in Section 3 illustrates the connection between the neighboring-time return distributions and gives the return distributions as its solution. For simplicity, the right-hand side of Eq. (4) is denoted ¯ as PE with the distributional Bellman operator as ¯ PE (|s) aA s+1 S where I is the indicator function, equal to 1 if vs,k and otherwise 0. A key aspect of the proposed algorithm is to represent the recursion relation in the distributional Bellman equation by extending the conventional temporaldifference learning to a particle smoothing approach, where a simple replacement procedure of particles associated with a state with those of the successor state is executed. We call this approach the Return Distribution Particle Smoother (RDPS). Given a state s, one can generate (a, s+1 , r) according to the policy , the state transition probability pT , and the reward probability PR . The quantity r + +1 , defined from a pair (r, +1 ) with +1 sampled from ¯ PE (+1 |s+1 ), can be regarded as following (or being ¯ drawn from) the distribution PE (|s). Thus, with in(N ) (1) (1) dependent paired samples (r , +1 ), . . . , (r(N ) , +1 ) given a state s, this limit holds; 1 N N lim N (a|s)pT (s+1 |s, a) × ¯ - r | s+1 dPR (r|s, a, s+1 ). PE R The distributional Bellman equations are expressed as ¯ ¯ PE = PE . When a probability function F (|s) satisfies the distributional Bellman equations for all of the states, the function F is the solution of the equations and is equivalent to the conditional cumulative distribution function of the return given a state. However, it is hard to deal with the distributional Bellman equation in practice due to its numerous functional degrees of freedom. To address this problem, we approximate a solution in a nonparametric way using particles. 4.2. Return distribution particle smoother (RDPS) algorithm We suppose the usage of look-up table for the states (or the state-action pairs). Each entry of the table has a number of particles3 , vs = {vs,1 , . . . , vs,K }, where the subscripts s and k of vs,k denotes the state and the identifier, respectively. Each particle vs,k R represents a return from the state s. For simplicity, we also suppose each state has the same number K of particles. The set VK R|S|×K is the complete set of particles {vs }, where |S| denotes the number of states. ¯ The return distribution PE (|s) is approximated by a distribution of the particles vs , i.e., an estimate of the cumulative probability distribution function of the return defined as ¯ P E (|s; VK ) 1 K K n=1 ¯ I(r(n)++1 ) = PE (|s). (n) Therefore, for satisfying the distribution Bellman ¯ ¯ equation PE = PE (Eq. (4)), the value r + +1 of the paired samples (r, +1 ) also has to be distributed ¯ according to the distribution P (|s), E r + +1 ¯ PE (|s). This result suggests that an adjustment of the ap¯ proximated return distribution is to allow P E (|s) to explain the sample (r, +1 ) appropriately. More specifically in the particle case, (some of) the particles vs need to explain or contain {r + vs+1 } = {(r + vs+1 ,1 ), . . . , (r + vs+1 ,K )} to some extent. One of the straightforward approaches would be to replace a particle vs,k randomly chosen from vs by a value r + vs+1 ,k as defined by the observed reward and a particle of the successor state. This approach has the desirable property of using the Kolmogorov­Smirnov statistic (distance) DKS {P (x), Q(x)} for a measure of the difference of the two cumulative distribution functions P (x) and Q(x), (Feller, 1948), DKS {P (x), Q(x)} sup P (x) - Q(x) . x k=1 I(vs,k ), (5) 3 However, our approach can be extended to continuous state space or feature vector space with neighborhood or discretization approach, i.e., as long as particles around a state can be gathered, our approach will be applicable. Proposition 2 Let VK = {vs,k } be a complete set of the particles and let all values of the particles, except for the particles of the state s, be fixed. Also let DKS (s, VK ) be a Kolmogorov­Smirnov statistic, ¯ ¯ DKS {P E (|s; VK ), P E (|s; VK )}, for which the following replacement is iterated a sufficient number of times, with the right-to-left substitution operator :=, vs,k := r + vs+1 ,l , (6) Nonparametric return distribution approximation for RL where k and l are integers independently drawn from the uniform distribution U(1, K) generating an integer from 1 to K, and r and s+1 are drawn from the model distributions PR and pT , respectively. Then there are bounds for the asymptotic mean and vari ance of DKS (s, K) in terms of the particle number K as ln(2), E lim KDKS (s, K) K 2 1 ( - 6 ln2 (2)), V lim KDKS (s, K) K 12 where E[·] and V[·] denote the expectation and the variance, respectively. Proof Sketch: It can be proved that, after a sufficient time iterations for the replacement of Eq. (6), each particle value of the state s is a return drawn ¯ from P E (|s; VK ). Therefore, DKS (s, VK ) can be regarded as the Kolmogorov­Smirnov statistic between an empirical distribution of K samples inde¯ pendently drawn from P E (|s; VK ) and the (hypoth¯ E (|s; VK ). With the results esized) distribution P in Kolmogoroff (1941) and Feller (1948), the limit, limK Pr(DKS (k, VK ) z/ K) (z), holds for z 0, and the limit is equal to zero for z < 0, where Table 1. Online algorithm for nonparametrically approximating conditional return distribution given a state Return Distribution Particle Smoother (RDPS) algorithm Given · a policy: (a|s) · a number of particles for each state: K · a particle updating rate [0, 1] · a discount rate for the return: [0, 1) Set · initial values of all particles: VK R|S|×K · an initial state: s0 {1, . . . , |S|} ( Pr(s0 )) For t = 0 to T do (Interaction with environment) · choose and execute action at (a|s) · observe following state st+1 and reward rt (Update particles) For n = 1 to K do · choose index for updated particle: p U(1, K) · choose index for targeted particle: q U(1, K) · update particle of st : vst ,p := rt + vst+1 ,q End End 4.3. Eligibility trace technique for RDPS Here, we extend the one-step distributional Bellman equation (Eq. (4)) to the multi-step t version, ¯ PE (|s) = a,s+1 ,...,a+t-1 ,s+t (z) 1-2 (-1)-1 exp(-2 2 z 2 ). =1 The first and second order moments of z (z) are given as (a|s)pT (s+1 |s, a) · · · (a+t-1 |s+t-1 ) ¯ - PE t-1 k=0 t z d(z) = z[0,] =1 8(-1)-1 2 z 2 exp(-2 2 z 2 )dz z[0,] = =1 (-1)-1 22 3 1 2 = ln(2), 2 × k r+k × pT (s+t |s+t-1 , a+t-1 ) | s+t r,...,r+t-1 z 2 d(z) = z[0,] (-1)-1 2 = , 2 2 12 =1 × dPR (r|s, a, s+1 ) · · · dPR (r+t-1 |s+t-1 , a+t-1 , s+t ) ¯ (7) t PE (|s) Based on Eq. (7), we can use the particles of the various successor states, vs+1 , . . . , vs+t , to update the particles of the current state, vs . We simply employ the eligibility trace technique for RL with an eligibility decay rate [0, 1) (Sutton & Barto, 1998) to use a set of multi-time-step particles, instead of the one-time-step particles vs+1 , where the target distribution for updating the particles vs is T T respectively. Thus, Proposition 2 is proved. Proposition 2 indicates that the Kolmogorov­Smirnov ¯ ¯ statistic DKS {P E , P E } will be smaller as the number of the particles is increased. While the number of replaced particles at each iteration in Proposition 2 is assumed to be one, it will be possible to accelerate the convergence speed for learning the particles by increasing the number of replaced particles in each iteration. Here we introduce a learning rate parameter [0, 1], which defines the number of particles, N , as N = K, where x min{n Z|n x}. The resulting online algorithm, termed RDPS, is described in Table 1. lim (1 - ) ¯ t-1 t P E (|s+t ; VK ), t=1 for a nonepisodic task. In the episodic task case, with [0, 1], the target distribution is T (1-) t=1 ¯ ¯ t-1 t P E (|s+t ; VK )+T -1 P E (|s+T ; VK ). Nonparametric return distribution approximation for RL This can be implemented easily by carrying over some of the particles chosen for the update to the next successor state. This ratio of carrying particles over is . This carrying-over process is repeated until there is no particle to be carried over or an absorbing state is reached in the episodic task. The proposed RDPS algorithm could be regarded as an extension of the temporal-difference learning with eligibility traces, TD(), for approximating the return distribution. Figure 1. Task setting: There are fourteen (normal) states and two special states A and B. State 5 is a start state. Values near by arrows are rewards. 5. Using the Approximated Return Distribution for Risk-Sensitive RL Now we have a means to approximately evaluate the return distributions, so we can formulate RL algorithms with any criterion defined on the basis of a return distribution. As a representative example, in this section we develop an explicit formulations of a SARSA-learning-type approach with Conditional Value at Risk (CVaR) (Rockafellar & Uryasev, 2000; Kashima, 2007) as the evaluation criterion, on the basis of the particle distributions for the approximation. Since the upper-tail CVaR (CVaR+ ) of the return for the state-action value function is defined with c [0, 1] as Q CVaR+ (s, a; c) E | PE (|s, a) 1 - c , Note that, when c = 1, then CVaR is equivalent to the expected value, Q (s, a; c = 1) = Q (s, a). CVaR+ Accordingly, if ct of the CVaR+ at a time-step t is scheduled as monotonically increasing to 1 over time, the initial policy in learning tends to take risks and then drive exploitation, while the final policy tries to maximize the ordinary value function. Thus, this scheduling of c will be useful if a big opportunity for improvement is hidden in a distant state. However, this type of agent will probably often be trapped in states that have heavy noise in the reward. It will be effectual to combine the RDPS with the CVaR approach and the other exploitation RL algorithm such as E3 or R-MAX (Kearns & Singh, 2002; Brafman & Tennenholtz, 2003). 6. Numerical Experiments 6.1. Task setting of 14-state MDP In this section, we investigated the performances of the new algorithms through a simple, but very illustrative, MDP. Fig. 1 shows the settings of this MDP, which is based on (Sutton & Barto, 1998) and (Sutton et al., 2009). There are fourteen (normal) states {1, 2, . . . , 14} and two special states {A, B}. Each normal state has two actions that cause left- and right-transitions to neighboring states. The rewards in each left- or right-transitions were zero or -1, respectively, except for a transition into the right special state B, for which a large reward of +30 was used. For an episodic task, the special states are absorbing (terminal) states. An episode begins in state 5 and continues until an absorbing state is reached. When the absorbing state was reached, a new episode restarts. In a nonepisodic case, both the special states were identical to the start state, i.e., upon arrival to A or B, the agent is teleported to the state `5'. Episodes began in the state 5 and continued until a simulation run was terminated. 6.2. Return distribution approximation Here, we used a fixed policy that chose the left- or right-transition action randomly with equal probability, i.e., (a|s) = 0.5, a, s. This type of MDP setting can be regarded as a random-walk problem, which is known as a useful RL benchmark to assess the prac- the policy based on this criterion will take a risk. In contrast, a policy based on the lower-tail CVaR, Q (s, a; c) E [ | lim0- PE ( + |s, a) c ], CVaR- will be risk averse. It is known that these risk-taking and risk-aversion strategies lead to the exploration and exploitation (robust) behaviors, respectively (Bagnell, 2004). From Eq. (5), an estimate of CVaR with RDPS can easily be computed as, for c (0, 1], 1 QCVaR+ (s, a; c) = cK K k=1 vs,k I(PE (vs,k |s, a) 1-c), or QCVaR+ (s, a; c = 0) = maxk {vs,k }. One possible approach is to use the CVaR+ of the return for the behavior policy for an efficient exploitation, while the target policy (also called the estimation policy) is to maximize the expected returns. In this scenario, the off-policy learning, in which the importance sampling is often used to adjust the learning rate (Precup et al., 2000), is suitable. Our algorithm uses the importance sampling to condition the number of updated particles, i.e., learning rate . This resulting algorithm, which we call the distributional-SARSA-with-CVaR (or dSARSA with CVaR) algorithm, can provide a practical risk-sensitive RL algorithm. Nonparametric return distribution approximation for RL (A) 0.5 0.45 0.4 0.35 0.3 0.25 =0 = 0.2 = 0.4 = 0.6 = 0.8 = 0.9 = 0.95 = 0.975 = 0.99 =1 MC (B) 0.05 0.04 0.03 0.02 0.01 0 -100 RDPS estimate at t=250 RDPS estimate at t=500 RDPS estimate at t=1,000 True Probability Density (C) 0.05 0.04 0.03 0.02 0.01 0 -100 MC estimate at t=250 MC estimate at t=500 MC estimate at t=1,000 True Probability Density Probability Density 0.2 0 0.2 0.4 0.6 0.8 Learning Rate 1 -50 0 Return from s=10 50 Probability Density KS Statistic -50 0 Return from s=10 50 Figure 2. Evaluations of the RDPS algorithm with 1,000 particles compared with the Monte-Carlo estimation in the episodic task: (A) Average Kolmogorov-Smirnov (KS) statistics of the state s = 10 at 500 time-steps over 500 independent simulation runs, (B) and (C) Typical estimates at 250, 500, and 1,000 time-steps, where the particles or the instances of the returns are converted to the probability density with the normal kernel density estimator. 0.5 =0 = 0.2 = 0.4 = 0.6 = 0.8 = 0.9 = 0.95 = 0.975 = 0.99 MC 0.45 0.4 0.35 0.3 by the RDPS and the LSTDQ() (Lagoudakis & Parr, 2003), respectively. The -greedy selection method was used for the policy (Sutton & Barto, 1998), where the action with the highest value is selected with probability + (1 - )/2 and one of the other actions is selected with probability (1 - )/2. The and the discounted rate were set to 0.1 and 0.98, respectively. The c parameter of RDPS with CVaR depended on the time t as ct = min{t/3000, 1}. Fig. 4 shows the average performances in the episodic task over 300 independent simulation runs. This result indicates that the d-SARSA-with-CVaR algorithm as a risk-sensitive approach based on the RDPS algorithm can handle risk as discussed in Section 5. KS Statistic 0.25 0.2 0.4 0.6 Learning Rate 0.8 1 Figure 3. Average Kolmogorov-Smirnov statistics of the state s = 10 in the non-episodic task at 500 time-steps over 500 independent simulation runs. tical utility of RL algorithms for value function regression, and which has been used often (Sutton & Barto, 1998; Sutton et al., 2009). We confirmed that the return distribution approximator of the RDPS with 0.95 worked well for both episodic and nonepisodic tasks, as shown in Figs. 2 and 3. The effect of the eligibility trace is very similar to the TD() algorithm of (Sutton & Barto, 1998). Therefore, our proposed RDPS algorithm can be regarded as a natural extension of the TD() learning from the expected return to the return distribution. 6.3. Policy learning Here, we assess the performance of the (off-policy) dSARSA-with-CVaR algorithm. We also used the (normal) d-SARSA and LSTD-SARSA4 as the baseline algorithms, the Q-value functions of which are estimated Since it is better if all of the tested algorithm have on the same setting (-greedy policy), we apply the LSTDSARSA, instead of the LSPI (Lagoudakis & Parr, 2003). 4 7. Conclusion We proposed an approach for approximating the distribution of returns, which allows us to handle various criteria including risk-sensitivities in a unified manner. The Bellman-type recursive formulas for the conditional cumulative probability distributions of the returns were derived. We proposed a nonparametric method for approximating the return distributions with particles. We also presented a risk-sensitive SARSA algorithm utilizing CVaR to explore effectively. The numerical experiments on simple MDPs indicated that the proposed algorithms are promising for the return distribution approximation and also in the risk-sensitive RL scenario. Analyses of the proposed algorithms especially in terms of their convergences and empirical studies with some more challenging domains will be necessary to more deeply understand the properties and efficiency of our proposed approach to nonparametrically approximating the return distributions. Nonparametric return distribution approximation for RL Expected Return (from the start state) (A) 30 25 20 15 10 5 0 -5 0 1000 2000 d-SARSA with CVaR d-SARSA LSTD-SARSA Doucet, A., de Freitas, N., and Gordon, N. Sequential Monte Carlo Methods in Practice. Springer, 2001. Engel, Y., Mannor, S., and Meir, R. Reinforcement learning with Gaussian processes. In International Conference on Machine Learning, pp. 201­ 208, 2005. Feller, W. On the Kolmogorov-Smirnov limit theorems for empirical distributions. The Annals of Mathematical Statistics, 19(2):177­189, 1948. Geibel, P. and Wysotzki, F. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research, 24:81­ 108, 2005. Ghavamzadeh, M. and Engel, Y. Bayesian actor-critic algorithms. In International Conference on Machine Learning, pp. 297­304, 2007. Kashima, H. Risk-sensitive learning via minimization of empirical conditional value-at-risk. IEICE Transaction on Information and Systems, E90-D (12):2043­2052, 2007. Kearns, M. and Singh, S. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49 (2-3):209­232, 2002. Kolmogoroff, A. Confidence limits for an unknown distribution function. The Annals of Mathematical Statistics, 12(4):461­463, 1941. Lagoudakis, M. G. and Parr, R. Least-squares policy iteration. Journal of Machine Learning Research, 4: 1107­1149, 2003. Precup, D., Sutton, R.S., and Singh, S. Eligibility traces for off-policy policy evaluation. In International Conference on Machine learning, 2000. Rockafellar, R. T. and Uryasev, S. Optimization of conditional value-at-risk. Physical Review E, 2(3): 21­41, 2000. Sato, M., Kimura, H., and Kobayahi, S. TD algorithm for the variance of return and mean-variance reinforcement learning. The IEICE Transactions on Information and Systems (Japanese Edition), 16(3): 353­362, 2001. Sugiyama, M., Hachiya, H., Kashima, H., and Morimura, T. Least absolute policy iteration for robust value function approximation. In IEEE International Conference on Robotics and Automation, pp. 699­704, 2009. Sutton, R. S. and Barto, A. G. Reinforcement Learning. MIT Press, 1998. Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesv´ri, C., and Wiewiora, a E. Fast gradient-descent methods for temporaldifference learning with linear function approximation. In International Conference on Machine Learning, pp. 993­1000, 2009. 3000 4000 Time Step 5000 6000 Expected Return (from the start state) (B) 30 25 20 15 10 5 0 -5 0 1000 2000 d-SARSA with CVaR d-SARSA LSTD-SARSA 3000 4000 Time Step 5000 6000 Figure 4. Average performances in the episodic task over 300 independent simulation runs: Error bar represents the standard deviation. (A) Deterministic reward setting, (B) Stochastic reward setting, where the reward of the transition to the absorbing state A is drawn from the normal distribution N (µ = 0, = 10) and the all other rewards are set same as those in the deterministic setting. Acknowledgments This work was supported by the PREDICT of the Ministry of Internal Affairs and Communications of Japan and also by the FIRST Program. References Bagnell, J. A. Learning Decisions: Robustness, Uncertainty, and Approximation. PhD thesis, Robotics Institute, Carnegie Mellon University, 2004. Bertsekas, D. P. Dynamic Programming and Optimal Control, Volumes 1 and 2. Athena Scientific, 1995. Brafman, R. I. and Tennenholtz, M. R-max ­ a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213­231, 2003. Dearden, R., Friedman, N., and Russell, S. Bayesian Q-learning. In National Conference on Artificial Intelligence, pp. 761­768, 1998. Defourny, B., Ernst, D., and Wehenkel, L. Riskaware decision making and dynamic programming. In NIPS 2008 Workshop on Model Uncertainty and Risk in RL, 2008.