Incremental Natural Actor-Critic Algorithms
Abstract
We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradient in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces, and the use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the policy gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda et al. by using temporal difference learning in the actor and by incorporating natural gradients, and extend prior empirical studies of natural actor-critic methods by Peters et al. by providing the first convergence proofs and the first fully incremental algorithms.

1

Introduction

Actor-critic (AC) algorithms are based on the simultaneous online estimation of the parameters of two structures, called the actor and the critic. The actor corresponds to a conventional actionselection policy, mapping states to actions in a probabilistic manner. The critic corresponds to a conventional value function, mapping states to expected cumulative future reward. Thus, the critic addresses a problem of prediction, whereas the actor is concerned with control. These problems are separable, but are solved simultaneously to find an optimal policy, as in policy iteration. A variety of methods can be used to solve the prediction problem, but the ones that have proved most effective in large applications are those based on some form of temporal difference (TD) learning (Sutton, 1988) in which estimates are updated on the basis of other estimates as they are in dynamic programming. Such bootstrapping methods can be viewed as a way of accelerating learning by trading bias for variance. AC methods were among the earliest to be investigated in reinforcement learning (Barto et al., 1983; Sutton, 1984; Williams, 1992). They were largely supplanted in the 1990's by methods which estimated state-action value functions and used them directly to select actions without an explicit policy structure. This approach was appealing because of its simplicity, but when combined with function approximation was found to have many theoretical difficulties including in some cases a failure to converge. These problems led to renewed interest in methods with an explicit representation of the policy, which came to be known as policy gradient methods (Marbach, 1998; Sutton et al., 2000; Konda & Tsitsiklis, 2000; Baxter & Bartlett, 2001). Policy gradient methods without bootstrapping can be easily proved convergent, but suffer from slow convergence caused by high variance of gradient estimation. Combining them with bootstrapping is a promising avenue toward a more effective method. Another approach to speeding up policy gradient algorithms was proposed by Kakade (2002) and then refined and extended by Bagnell and Schneider (2003) and by Peters et al. (2005). Their idea was to replace the policy gradient estimate with an estimate of the so-called natural policy gradient. This is motivated by the requirement that a change in the way the policy is parametrized should not influence the result of the policy update. In terms of the policy update rule, the move to the natural gradient amounts to linearly transforming the gradient using the inverse Fisher information matrix of the policy. 1


In this paper, we introduce four new AC algorithms, three of which incorporate natural gradients. All the algorithms are for the average reward setting and use function approximation in the value function. For all four methods we prove convergence of the parameters of the policy and state value function to a local maximum of a performance function that corresponds to the average reward plus a measure of the TD-error inherent in the function approximation. Due to space limitations, we do not present the convergence analysis of our algorithms here; it can be found in a supporting document submitted along with this paper. Some empirical results using our algorithms can also be found in the supporting document. Our main observations here are that all our algorithms perform better than the algorithm of Konda and Tsitsiklis (2000) with our Algorithm 3 showing the best results overall. Our results extend prior AC methods, especially those of Konda and Tsitsiklis (2000) and of Peters et al. (2005). We discuss these relationships in detail in Section 6. Our analysis does not cover the use of eligibility traces but we believe the extension to this case would be straightforward.

2

The Policy Gradient Framework

We consider the standard reinforcement learning framework (e.g., see Sutton & Barto, 1998), in which a learning agent interacts with a stochastic environment and this interaction is modeled as a discrete-time Markov decision process. The state, action, and reward at each time t  {0, 1, 2, . . .} are denoted st  S , at  A, and rt  R respectively. We assume the reward is random, real-valued, and uniformly bounded. The environment's dynamics are characterized by state transition probabilities p(s |s, a) = Pr(st+1 = s |st = s, at = a), and single-stage expected rewards r(s, a) = E[rt+1 |st = s, at = a], s, s  S , a  A. The agent selects an action at each time t using a randomized stationary policy  (a|s) = Pr(at = a|st = s). Given a fixed policy, the sequence of state-action pairs is generated by the Markov chain induced by that policy. We assume (B1) The Markov chain induced by any policy is irreducible and aperiodic. The long-term average reward per step under policy  is defined as = T -1 a s t 1  (a|s)r(s, a), d (s) rt+1 | J ( ) = lim E T  T =0
S A

In policy gradient methods, we define a class of parameterized stochastic policies { (ˇ|s;  ), s  S ,   }, estimate the gradient of the average reward with respect to the policy parameters  from the observed states, actions, and rewards, and then improve the policy by adjusting its parameters in the direction of the gradient. Since in this setting a policy  is represented by its parameters  , policy dependent functions such as J ( ), d (ˇ), V  (ˇ), and Q (ˇ, ˇ) can be written as J ( ), d(ˇ;  ), V (ˇ;  ), and Q(ˇ, ˇ;  ), respectively. We assume (B2) For any state-action pair (s, a), policy  (a|s;  ) is continuously differentiable in the parameters  .

where d (s) is the stationary distribution of state s under policy  . The limit here is welldefined under (B1). Our aim is to find a policy   that maximizes the average reward, i.e.,   = arg max J ( ). In the average reward formulation, a policy  is assessed according to the expected differential reward associated with states s or state-action pairs (s, a). For all states s  S and actions a  A, the differential state-action value function and the differential state value function under policy  are defined as1 t a Q (s, a) = E[rt+1 - J ( )|s0 = s, a0 = a,  ] , V  (s) =  (a|s)Q (s, a). (1)
=0 A

Previous works (Marbach, 1998; Sutton et al., 2000; Baxter & Bartlett, 2001) have shown that the gradient of the average reward for parameterized policies that satisfy (B1) and (B2) is given by 2 s a J() = d (s) (a|s)Q (s, a). (2)
S A
1 From now on in the paper, we use the terms state value function and state-action value function instead of differential state value function and differential state-action value function. 2 Throughout the paper, we use notation to denote  ­ the gradient w.r.t. the policy parameters.

2


Observe that if b(s) is any given function of s (also called a baseline), then s a s a s d (s)b(s) (1) = 0,  (a|s)) = d (s)b(s) ( (a|s)b(s) = d (s)
S A S A S

The baseline can be chosen in a way that the variance of the gradient estimates is minimized (Greensmith et al., 2004). The natural gradient, denoted ~ J(), can be calculated by linearly transforming the regular gradient, J(), using the inverse Fisher information matrix of the policy ~ J() = G-1 ( ) J(). The Fisher information matrix G( ) is positive definite and symmetric, and is given by G( ) = Esd ,a [ log  (a|s) log  (a|s)
]

and thus for any b(s), the gradient of the average reward can be written as s a J() = d (s) (a|s)(Q (s, a) ą b(s)).
S A

(3)

(4)

3

Policy Gradient with Approximation

Now consider the case in which the state-action value function for a fixed policy  , Q  , is approximated by a learned function approximator. If the approximation is sufficiently good, we might hope to use it in place of Q in Eq. 2 and 3, and still point roughly in the direction of the true gradient. ^w Sutton et al. (2000) showed that if the approximation Q with parameters w is compatible, i.e., ^  (s, a) = log  (a|s), and minimizes the mean squared error w Qw a s ^ (5)  (a|s)[Q (s, a) - Q (s, a)]2 d (s) E  (w) = w
S A

^w for parameter value w  , then we can replace Q with Q  in Eq. 2 and 3. Thus, we work with ^  (s, a) = w  (s, a) in which the  (s, a)'s are compatible features defined linear approximation Qw according to  (s, a) = log  (a|s). Note that compatible features are well defined under (B2). The Fisher information matrix of Eq. 4 can be written using the compatible features as G( ) = Esd ,a [ (s, a) (s, a) ]. Suppose E (w) denotes the mean squared error a s  (a|s)[Q (s, a) - w d (s) E  (w) =
 S A 

(6) (7)

(s, a) - b(s)]2

of our compatible linear parameterized approximation w  (s, a) and an arbitrary baseline b(s). Let w = arg minw E  (w) denote the optimal parameter. We first show in Lemma 1 that the value of w does not depend on the given baseline b(s) and as a result the mean squared error problems of Eq. 5 and 7 have the same solutions. Next, in Lemma 2, we show that if the parameter is set to be equal to w  , the resulting mean squared error E  (w ) (now treated as a function of the baseline b(s)) is further minimized when b(s) = V  (s). In other words, the variance in the state-action value function estimator is minimized if the baseline is chosen to be the value function itself. Lemma 1 The optimum weight parameter w w Proof Note that wE  (w) = -2 s d (s)
S = f

or any given  (policy  ) satisfies3

G( )-1 Esd ,a [Q (s, a) (s, a)]. a  (a|s)[Q (s, a) - w


(s, a) - b(s)] (s, a).

(8)

A

Equating the above to zero, one obtains
X
sS 3

d (s)

X
aA

 (a|s) (s, a) (s, a)

w

=

X
sS

d (s)

X
aA

 (a|s)Q (s, a) (s, a)-

X
sS

d (s)

X
aA

 (a|s)b(s) (s, a).

This lemma is similar to Kakade's (2002) Theorem 1.

3


a a The last term on the right hand side equals zero since A  (a|s) (s, a) = A (a|s) = 0   for any state s. Now from Eq. 8, the Hessian 2 E (w) evaluated at w can be seen to be 2G( ). w The claim follows since G( ) is positive definite for any  . N ext, given the optimum weight parameter w  , we obtain the minimum variance baseline in the state-action value function estimator corresponding to policy  . Thus we consider now E  (w ) and obtain b (s) = arg minb=(b(s) , sS ) E  (w ). Lemma 2 For any given policy  , the minimum variance baseline b (s) in the state-action value function estimator corresponds to the value function V  (s). a   (s, a) - b(s)]2 . Proof For any s s  S , let E ,s (w ) = A  (a|s)[Q (s, a) - w     ,s  Then E (w ) = (w ). Note that by (B1), the Markov chain corresponding to any S d (s)E policy  is positive recurrent since the number of states is finite. Hence, d (s) > 0 for all s  S . Thus, one needs to find the optimal b(s) in E ,s (w ) for each s  S . a  E ,s (w ) = -2  (a|s)[Q (s, a) - w   (s, a) - b(s)].  b(s)
A

a a  The rightmost term equals zero since A  (a|s) A  (a|s) (s, a) = 0. Hence b (s) = F  (s, a) = V  (s). The second derivative of E ,s (w ) w.r.t. b(s) equals 2. The claim follows. Q rom Lemmas 1 and 2, w   (s, a) is a least-squared optimal parametric representation for the advantage function A (s, a) = Q (s, a) - V  as) as well as for the state-action value function (  Q (s, a). However, since Ea [w  (s, a)] = (s, a) = 0, s  S , it is better A  (a|s)w  to think of w (s, a) as an approximation of the advantage function A (s, a) rather than of the state-action value function Q (s, a). ^ ^ The TD-error t is a random quantity that is defined according to t = rt+1 - Jt+1 + V (st+1 ) - ^ (st ), where V and J are consistent estimates of state value function and average reward, ^ ^ V ^ ^ respectively. Thus, these estimates satisfy E[V (st )|st ,  ] = V  (st ) and E[Jt+1 |st ,  ] = J ( ), for any t  0. The next lemma shows that t is a consistent estimate of the advantage function A . Lemma 3 Under given policy  , we have E[t |st , at ,  ] = A (st , at ). Proof Note that
^ ^ ^ ^ E[t |st , at ,  ] = E[rt+1 -Jt+1 +V (st+1 )-V (st )|st , at ,  ] = r(st , at )-J ( )+E[V (st+1 )|st , at ,  ]-V  (st ).

Equating the above to zero, we obtain a a b (s) =  (a|s)Q (s, a) -
A

 (a|s)w 


(s, a).

A

Now
^ ^ E[V (st+1 )|st , at ,  ] = E[E[V (st+1 )|st+1 ,  ]|st , at ,  ] = E[V  (st+1 )|st , at ] = X
st+1 S

p(st+1 |st , at )V  (st+1 ).

A B lso r(st , at ) - J ( ) +

y setting the baseline b(s) equal to the value function V  (s), Eq. 3 can be written as s a   J() = S d (s) A  (a|s) (s, a)A (s, a). From Lemma 3, t is a consistent estimate of the advantage function A (s, a), thus, ^ J() = t  (st , at ) is a consistent estimate of J(). ^ ^ However, calculating t requires having estimates of average reward J and value function V . While an average reward estimate is simple enough to obtain given the single stage reward function, the same is not necessarily true for the value function. We use function approximation for the value function as well. Suppose f (s) is a feature vector for state s. One may then approximate V  (s) with v f (s), where v is a parameter vector that can be tuned (for a fixed policy  ) using a TD ^ algorithm. In our algorithms, we use t = rt+1 - Jt+1 + v t f (st+1 ) - v t f (st ) as an estimate for the TD-error, where v t corresponds to the tth update of the value function parameter. 4

s

t+1 S

p(st+1 |st , at )V (st+1 ) = Q (st , at ). The claim follows.
 

Proof of this lemma can be found in the supporting document submitted with the paper. Note that according to Theorem 1 of (Sutton et al., 2000), E[t  (st , at )| ] = J(), provided t is defined ^ ^ ^ as t = rt+1 - Jt+1 + V (st+1 ) - V (st ) (as was considered in Lemma 3). For the case with function s  f Ż (s)] may approximation that we study, from Lemma 4, the quantity S d (s)[ V (s) - v be viewed as the error or bias in the estimate of the gradient of average reward that results from the use of function approximation.

a s | f Ż Let V  (s) = (s )], where v  f (s ) is an S p(s s, a)v A  (a|s)[r (s, a) - J ( ) + estimate of the value function V  (s ) that is obtained upon convergence viz., limt v t = v    ^ with probability one. Let also t = rt+1 - Jt+1 + v  f (st+1 ) - v  f (st ), where t corresponds to a stationary estimate of the TD-error with function approximation under policy  . We have the following analog of Sutton et al.'s (2000) Theorem 1. Note that  is the policy parameter vector corresponding to policy  . s   Lemma 4 E[t  (st , at )| ] = J() + Ż v  f (s)]. S d (s)[ V (s) -

4

Actor-Critic Algorithms

We present four new AC algorithms in this section. These algorithms are in the general form shown in Table 1. They update the policy parameters along the direction of the average reward gradient. While estimates of the regular gradient are used for this purpose in Algorithm 1, natural gradient estimates are used in Algorithms 2-4. While critic updates in our algorithms can be easily extended to the case of TD(),  > 0, we restrict our attention to the case when  = 0. In addition to assumptions (B1) and (B2), we make the following assumption: (B3) The step-size schedules for the critic {t } and the actor {t } satisfy t t t t 2 2 t <  , t = o(t ). t , t =  , t =

(9)

As a consequence of Eq. 9, t  0 faster than t . Hence actor is a faster recursion than critic as beyond some t0 . In other words, the critic has uniformly higher increments than the actor for t  t 0 , and thus it converges faster than the actor. 1: 2: Table 1: A Template for Incremental AC Algorithms. Input: ˇ Randomized parameterized policy  (ˇ|ˇ;  ), ˇ Value function feature vector f (s). Initialization: ˇ Policy parameters  =  0 , ˇ Value function weight vector v = v 0 , ˇ Step sizes  = 0 ,  = 0 ,  = c0 . for t = 0, 1, 2, . . . do Execution: ˇ Draw action at   (at |st ;  t ), ˇ Observe next state st+1  p(st+1 |st , at ), ˇ Observe reward rt+1 . ^ ^ Average Reward Update: Jt+1 = (1 - t )Jt + t rt+1 ^ TD-error: t = rt+1 - Jt+1 + v t f st+1 - v t f st Critic Update: algorithm specific (see the text) Actor Update: algorithm specific (see the text) endfor return Policy and value function parameters  , v

3: 4:

5: 6: 7: 8: 9: 10:

We now present the critic and the actor updates of our four AC algorithms. Algorithm 1 (Regular-Gradient AC): Critic Update: Actor Update:

v t+1 = v t + t t f (st ),  t+1 =  t + t t  (st , at ). 5


This is the only AC algorithm presented in the paper that is based on the regular gradient estimate. This algorithm stores two parameter vectors  and v , and its per time-step computational cost is linear in the number of policy and value function parameters. The next algorithm is based on the natural gradient estimate ~ J( t ) = G-1 ( t )t  (st , at ) in place of the regular gradient estimate in Algorithm 1. We derive a procedure for recursively estimating G-1 ( ), and show in Lemma 5 that our estimate G-1 converges to G-1 ( ) as t   with t probability one. This is required for proving convergence of this talgorithm. The Fisher information 1 matrix can be estimated in an online manner as Gt+1 = t+1 i=0  (si , ai ) (si , ai ). One may 1 1 obtain recursively Gt+1 = (1 - t+1 )Gt + t+1  (st , at ) (st , at ), or more generally Gt+1 = (1 - t )Gt + t  (st , at )
(

st , at ).

(10)

For our Alg. 2 and 4, we require the following additional assumption for the convergence analysis. (B4) The iterates Gt and G-1 satisfy supt,,s,a t Gt and supt,,s,a G-1 < . t

Using Sherman-Morrison matrix inversion lemma, one obtains G 1 G-1  (st , at )(G-1  (st , at )) 1 -1 t t - t G-11 = t t+ - 1 - t - t + t  (st , at )Gt 1  (st , at )

(

11)

Lemma 5 For any given parameter  , G-1 in Eq. 11 satisfy G-1  G( )-1 as t   t t with probability one. Proof It is easy to see from Eq. 10 that Gt  G( ) as t   with probability one, for any given  held fixed. Now for a fixed  , we have G-1 -G-1 ( ) = G-1 ( )(G( )G-1 - I ) = G-1 ( )(G( ) - Gt )G-1  t t t sup


G-1 ( )

sup
t,s,a

G-1 t

ˇ

G() - Gt  0

as

t  ,

W assumption (B4). The claim follows. by e thus have the following AC with Fisher information matrix algorithm. This algorithm stores a matrix G-1 and two parameter vectors  and v , and its per time-step computational cost is linear in the number of value function parameters and quadratic in the number of policy parameters. Algorithm 2 (Natural-Gradient AC with Fisher Information Matrix): Critic Update: Actor Update: v t+1 = v t + t t f (st ),  t+1 =  t + t G-11 t  (st , at ), t+

with the estimate of the inverse Fisher information matrix updated according to Eq. 11. We let G-1 = k I , where k is a positive constant. Thus G-1 and G0 are positive definite and symmetric 0 0 matrices. From Eq. 10, Gt , t > 0 can be seen to be positive definite and symmetric as these are convex combinations of positive definite and symmetric matrices. Hence, G-1 , t > 0 are positive t definite and symmetric as well. As we mentioned in Section 3, it is better to think of the compatible approximation w  (s, a) as an approximation of the advantage function rather than of the state-action value function. In our next algorithm, we tune the parameters w in a way as to minimize an estimate of the least-squared error E  (w) = Esd ,a [(w  (s, a) - A (s, a))2 ]. The gradient of E  (w) is thus wE  (w) = 2Esd ,a [(w  (s, a) - A (s, a)) (s, a)] that can be estimated as ^ wE  (w) = 2[ (st , at ) (st , at ) w - t  (st , at )]. Hence, we update advantage parameters w along with value function parameters v in the critic update of this algorithm. As with Peters et al. (2005), we use the natural gradient estimate ~ J( t ) = wt+1 in the actor update of Alg. 3. This algorithm stores three parameter vectors v , w, and  , and its per time-step computational cost is linear in the number of value function parameters and quadratic in the number of policy parameters.

6


Algorithm 3 (Natural-Gradient AC with Advantage Parameters): Critic Update: Actor Update: v t+1 = v t + t t f (st ), wt+1 = [I - t  (st , at )  t+1 =  t + t wt+1 .
(

st , at )]wt + t t  (st , at ).

Although the estimates of G-1 ( ) are not explicitly computed and used in Algorithm 3, the convergence analysis of this algorithm shows that the overall scheme still moves in the direction of the natural gradient of average reward. In Algorithm 4, however, we explicitly estimate G-1 ( ) (as in Algorithm 2), and use it in the critic update for w. The overall scheme is again seen to follow the direction of the natural gradient of average reward. Here, we let ~ wE  (w) = 2G-1 [ (st , at ) (st , at ) w - t  (st , at )] be the estimate of the natural gradient of t the least-squared error E  (w). This also simplifies the critic update for w. Algorithm 4 stores a matrix G-1 and three parameter vectors v , w, and  , and its per time-step computational cost is linear in the number of value function parameters and quadratic in the number of policy parameters. Algorithm 4 (Natural-Gradient AC with Advantage Parameters and Fisher Information Matrix): Critic Update: Actor Update: v t+1 = v t + t t f (st ), wt+1 = (1 - t )wt + t G-11 t  (st , at ). t+  t+1 =  t + t wt+1 , where the estimate of the inverse Fisher information matrix is updated according to Eq. 11.

5

Convergence of Actor-Critic Algorithms

Since our algorithms are gradient-based, one cannot expect to prove convergence to a globally optimal policy. The best that one could hope for is the convergence to a local maximum of J ( ). However, since the critic will generally converge to an approximation of the desired projection of the value function (defined by the value function features f ) in the proposed algorithms, the corresponding convergence results are necessarily weaker as indicated by the following theorem.
^ Theorem 1, For the parameter iterations in Algorithms 1-4,4 we have (Jt , v t ,  t )   )|  Z } as t   with probability one, where the set Z corresponds to  the set of local maxima of a performance function whose gradient is E[t  (st , at )|] (cf. Lemma 4). {(J ( ), v 

For the proof of this theorem, please refer to Section 6 (Convergence Analysis) in the supporting document. This theorem indicates that the policy and state value function parameters converge to a local maximum of a performance function that corresponds to the average reward plus a measure of the TD-error inherent in the function approximation.

6

Relation to Previous Algorithms

Actor-Critic Algorithm of Konda and Tsitsiklis (2000): Unlike our Alg. 2-4, their algorithm does not use estimates of the natural gradient in its actor's update. Their algorithm is similar to our Alg. 1, but with some key differences. 1) Konda's algorithm uses the Markov process of state-action pairs, and thus its critic update is based on an state-action value function. Alg. 1 uses the state process, and therefore its critic update is based on a state value function. 2) Whereas Alg. 1 uses TD-error in both critic and actor recursions, Konda's algorithm uses TD-error only in its critic update. The actor recursion in Konda's algorithm uses an action-value estimate instead. Because the TD-error is a consistent estimate of the advantage function (Lemma 3), the actor recursion in Alg. 1 uses estimates of advantages instead of action-values, which may result in lower variances. 3) The convergence analysis of Konda's algorithm is based on the martingale approach and aims at bounding error terms and directly showing convergence. Convergence to a local optimum is shown when a TD(1) critic is used. For the case where  < 1, they show that given an > 0, there exists  close enough to one such that when a TD() critic is used, one gets lim inf t | J( t )| < with probability one. Unlike
4 The proof of this theorem requires another assumption viz., (A3) in the supporting document, in addition to (B1)-(B3) (resp. (B1)-(B4)) for Algorithm 1 and 3 (resp. for Algorithm 2 and 4). This was not included in the paper due to space limitations.

7


Konda and Tsitsiklis, we primarily use the ordinary differential equation (ODE) based approach for our convergence analysis. Even though we also use martingale arguments in our analysis, these are restricted to showing that the noise terms asymptotically diminish and the resulting scheme can be viewed as a Euler-discretization of the associated ODE. Natural Actor-Critic Algorithm of Peters et al. (2005): Our Algorithms 2-4 extend their algorithm by being fully incremental and providing convergence proofs. Peters's algorithm uses a least-squares TD method in its critic's update, whereas all our algorithms are fully incremental. It is not clear how to satisfactorily incorporate least-squares TD methods in a context in which the policy is changing. Our proof techniques do not immediately extend to this case. As in Peters's algorithm, we use estimates of the advantage function in Algorithms 3 and 4.

7

Conclusions and Future Work

We have introduced and analyzed four AC algorithms utilizing both linear function approximation and bootstrapping, a combination which seems essential to large-scale applications of reinforcement learning. All of the algorithms are based on existing ideas such as TD-learning, natural policy gradients, and two-timescale stochastic approximation, but combined in new ways. The main contribution of this paper is proving convergence of the four algorithms to a local maximum in the space of policy and value function parameters. Our Alg. 2-4 are explorations of the use of natural gradients within an AC architecture. The way we use natural gradients is distinctive in that it is totally incremental: the policy is changed on every time step yet the gradient computation is never reset as it is in the algorithm of Peters et al. (2005). Alg. 3 is perhaps the most interesting of the three natural gradient algorithms. It never explicitly stores an estimate of the inverse Fisher information matrix and, as a result, it requires less computation. In empirical experiments (not reported here) we found it easier to find good parameter settings for Alg. 3 than for the other natural gradient algorithms, and perhaps because of this, it converged more rapidly than them and than Konda's algorithm. We in fact found all our algorithms to perform better than Konda's algorithm. There are a number of ways in which our results are limited and suggest future work. 1) It is important to characterize the quality of the converged solutions, either by bounding the performance loss due to bootstrapping and approximation error, or through a thorough empirical study. 2) The algorithms can be extended to incorporate eligibility traces and least-squares methods. As discussed earlier, the former seems straightforward whereas the latter requires more fundamental extensions. 3) Application of the algorithms to real-world problems is needed to assess their ultimate utility.

References
Bagnell, J., & Schneider, J. (2003). Covariant policy search. Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence. Barto, A., Sutton, R., & Anderson, C. (1983). Neuron-like elements that can solve difficult learning control problems. IEEE Transaction on Systems, Man and Cybernetics, 13, 835­846. Baxter, J., & Bartlett, P. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319­350. Greensmith, E., Bartlett, P., & Baxter, J. (2004). Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5, 1471­1530. Kakade, S. (2002). A natural policy gradient. Proceedings of NIPS 14. Konda, V., & Tsitsiklis, J. (2000). Actor-Critic algorithms. Proceedings of NIPS 12 (pp. 1008­1014). Marbach, P. (1998). Simulated-based methods for Markov decision processes. Doctoral dissertation, MIT. Peters, J., Vijayakumar, S., & Schaal, S. (2005). Natural actor-critic. Proceedings of the Sixteenth European Conference on Machine Learning (pp. 280­291). Sutton, R. (1984). Temporal credit assignment in reinforcement learning. Doctoral dissertation, University of Massachusetts Amherst. Sutton, R. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9­44. Sutton, R., & Barto, A. (1998). An introduction to reinforcement learning. MIT Press. Sutton, R., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. Proceedings of Advances in Neural Information Processing Systems 12. Williams, R. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229­256.

8