Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting Lambda

Carlton Downey Victoria University of Wellington, Wellington, New Zealand

carlton.downey@ecs.vuw.ac.nz

Scott Sanner ssanner@nicta.com.au Statistical Machine Learning Group & AI Group, NICTA & ANU, Canberra, Australia

Abstract
Temporal difference (TD) algorithms are attractive for reinforcement learning due to their ease-of-implementation and use of "bootstrapped" return estimates to make efficient use of sampled data. In particular, TD() methods comprise a family of reinforcement learning algorithms that often yield fast convergence by averaging multiple estimators of the expected return. However, TD() chooses a very specific way of averaging these estimators based on the fixed parameter , which may not lead to optimal convergence rates in all settings. In this paper, we derive an automated Bayesian approach to setting  that we call temporal difference Bayesian model averaging (TDBMA). Empirically, TD-BMA always performs as well and often much better than the best fixed  for TD() (even when performance for different values of  varies across problems) without requiring that  or any analogous parameter be manually tuned.

Temporal difference (TD) learning methods have proved to be an empirically effective way to reduce the sample complexity of RL algorithms (Sutton & Barto, 1998). The primary insight of these methods is that rather than taking independent Monte Carlo (MC) sample estimates of the value being learned, one may achieve faster learning rates by reusing "bootstrapped" value estimates from immediate successor states. One popular and particularly effective TD method -- TD() -- goes even one step further in reducing sample complexity by averaging over different n-step lookahead estimators of the same value, weighting by a function of  that decays exponentially in n. In this work, we build on the TD() approach of computing a weighted average of n-step return value estimates. However, our weighted averaging approach is intended to directly exploit one of the reasons why TD() methods are often successful in practice at reducing sample complexity -- weighted averages can reduce the variance of the value estimate, leading to less noise in the update process, and ultimately faster convergence. Based on this view, we adopt a statistically principled method of weighting the n-step return value estimates in a TD()-like approach we call Temporal Difference Bayesian Model Averaging (TD-BMA). Empirically, we show that TD-BMA generally performs much better than the best fixed  for TD() and does so automatically without requiring  or any analogous parameter be manually tuned for a problem. Given the promise of TD-BMA methods, we conclude with a discussion of important future generalizations.

1. Introduction
In reinforcement learning (RL) it can be crucial to minimize sample complexity (Kakade, 2003) -- a bound on the number of samples required to achieve a given quality policy. Fewer samples may lead to faster algorithms, but perhaps even more crucial to real-world applications, fewer samples require fewer expensive interactions with the environment. This is especially important when these samples involve the risk of mistakes that have real costs (monetary or otherwise).
Appearing in Proceedings of the 27 International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).
th

2. Preliminaries
2.1. Markov Decision Processes We assume the decision-making environment to be a Markov decision process (MDP) (Puterman, 1994) in

Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting Lambda

which an agent interacts by repeatedly executing an action in the currently observed state, receiving a reward signal and then stochastically transitioning to a successor state. Formally, an MDP can be defined as a tuple S, A, T, R . S = {s1 , . . . , sn } is a finite set of fully observable states. A = {a1 , . . . , am } is a finite set of actions. T : S × A × S  [0, 1] is a stationary, Markovian transition function, where we write T (s, a, s ) = P (s |s, a). We will assume that a stochastic reward function R : S × A × R  [0, 1] is associated with every state and action, where we write R(s, a, r) = P (r|s, a) for r  R.  (0    1) is a discount factor used to specify that a reward obtained t timesteps into the future is discounted by  t .  = 1 is permitted if total accumulated reward is finite. A stochastic exploration policy  : S ×A  [0, 1] specifies a probability distribution (s, a) = P (a|s) over actions a to take in each state s. The value Q (s, a) of taking an action a in state s and then following the policy  thereafter can be defined using the infinite horizon, expected discounted reward criterion:


Algorithm 1 Generalized Policy Iteration
begin 1. Start with arbitrary initial policy 0 & i = 0. 2. Estimate Qi (s, a) 8 arg maxa Qi (s, a) > >1 - > < a1 3. i+1 (s, a) = P (a|s) = |A| >· · · ··· > > : a|A| |A| 4. If termination criteria not met, i = i + 1 & goto 2. end

episodic RL setting consisting of multiple trials. Each trial terminates at some (different) time T , after which all rewards rt for t > T + 1 are assumed to be 0. 2.2. Temporal Difference Learning Temporal difference (TD) RL methods (Sutton & Barto, 1998) were introduced in order to allow efficient reuse of all sampled rewards during policy evaluation. The basic idea begins with the observation that for (1) time t, a 1-step sample return Rt of Q (st , at ) given in (1) can be "bootstrapped" from the estimate for Q (st+1 , at+1 ) Rt
(1)

Q (s, a) = E
t=0

 t · rt+1 s0 = s, a0 = a

(1)

= rt+1 + Q (st+1 , at+1 )

(2)

where rt+1 is the reward obtained after taking an action at in state st at time t (assuming s0 and a0 respectively represent the state and action at t = 0). Then we can define a value function V  (s) = Q (s, (s)) that represents the expected value obtained by starting in state s and acting according to . The objective in an MDP is to find a policy   such  that , s. V  (s)  V  (s); at least one such optimal policy is guaranteed to exist (Puterman, 1994). In the reinforcement learning (RL) setting, the transition probabilities P (s |s, a) and reward probabilities P (r|s, a) may not be explicitly known to the agent although both can be sampled from experience. In order to learn an optimal policy, we can adopt the generalized policy iteration (GPI) framework shown in Algorithm 1 that is known to capture many reinforcement learning approaches (Sutton & Barto, 1998). GPI interleaves policy evaluation and greedy policy update steps (steps 2 and 3, respectively). Since we require stochastic policies to ensure all states and actions are explored in RL, we use the simple -greedy exploration policy shown in step 3, where exploratory actions are chosen uniformly randomly with probability. Because step 3 is given, the remainder of this paper will focus on efficient policy evaluation in step 2 of GPI via temporal difference reinforcement learning. For the policy evaluation task, we assume a standard

where st+1 and at+1 are sampled according to  and rt+1 sampled according to P (rt+1 |st , at ). Then given (1) Rt , we arrive at a simple 1-step TD update rule that allows us to adjust Q (st+1 , at+1 ) according to the TD (1) error Qt : Q (st , at ) = Q (st , at ) +  Rt
(1)

- Q (st , at )
Qt
(1)

(3)

Here,  > 0 is a learning rate. In the literature, this Qvalue version of the TD update is known as SARSA for state, action, reward, state, action (Sutton & Barto, 1998). In general, rather than just a 1-step return, we can define arbitrary n-step (n > 0) bootstrapped returns:
n (n) Rt

=
i=1

 i-1 rt+i +  n Q (st+n , at+n )

(4)

where all actions a1 , . . . , at+n are sampled from (s, a) = P (a|s), all states s1 , . . . , st+n are sampled from T (s, a, s ) = P (s |a, s), and all rewards r1 , . . . , rt+n are sampled according to P (rt+i |st+i-1 , at+i-1 ). All Rt turn out to be estimators for Q (st , at ). Since it is well-known that averaging samples of multiple estimators yields reduced variance over the use of any
(n)

Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting Lambda

Algorithm 2 SARSA()-Offline-Update(T )
begin  RT = rT +1 for (t = T . . . 0) do if (t < T ) then // Compute -average of returns ^ ~   Rt = rt+1 + (1 - )Q (st+1 , at+1 ) + Rt+1 ~ ^  Q (st , at ) = Q (st , at ) +  Rt - Q (st , at ) end

3. Temporal Difference Bayesian Model Averaging
If we revisit the TD() return in (5), we note this expression chooses a particular fixed function of  to weight each bootstrapped n-step return estimate. While there are excellent computational reasons for choosing this particular weighting scheme, we set aside such issues for now and set out to answer a simple question: is there a potentially better way to set the weights of each t-step return, e.g., with an objective to reduce variance in the value estimator? One commonly used technique for reducing estimator variance is Bayesian model averaging (BMA). In particular, BMA provides an intuitive data-dependent way to adjust the weights of multiple estimators according to the likelihood that they might have generated the observed data. But how do we combine BMA with TD estimators? We begin with the general case and then proceed to develop a specific model that is effective in practice and computationally efficient -- having the same time and space complexity as SARSA(), but adapting  in a Bayesian manner. 3.1. General Case In the general Bayesian model averaging setting, we will maintain Q-values Q (s, a), not as constants, but as a multivariate probability distribution P (q|D) where D represents the set of all returns observed for a state and action and q = (qs1 ,a1 , ..., qs1 ,a|A| , ..., qs|S| ,a1 , ..., qs|S| ,a|A| )  R|S||A| . In Section 3.2, we will discuss concrete ways to maintain this Q-value distribution, but for now we assume it is given and focus on how to use it for TD learning. Given P (qs,a |D), we derive an expected return model BMA Rt for state s = st and action a = at based on Bayesian model averaging: RBMA = EP (q|D) [qs,a ] t =
qs,a R

single sample, it is clearly advantageous to use some form of averaged return in place of the single 1-step (1) return Rt to reduce variance. One particularly elegant and computationally efficient weighted averaging approach is given by TD() (0    1), which defines the return (Sutton & Barto, 1998):
  Rt = (1 - ) n=1

n-1 Rt

(n)

(5)

One can verify that the geometric series sum of weights  (1-) n=1 n-1 = 1. Then instead of the 1-step TD (1) update, we can substitute the 1-step TD error Qt with the n-step averaged TD() error
 Q = Rt - Q (st , at ) t

(6)

in (3) and thus obtain the the TD() update rule for Q-values known as SARSA(). Looking ahead to a form that will be useful for our derivation of temporal difference Bayesian model averaging, we will derive a simple offline1 algorithm for TD() following Section 7.4 of (Sutton & Barto, 1998) to reorganize the summation and express the return   Rt recursively in terms of Rt+1 :
T -1  Rt

=
k=t (1)

()k-t Rk - Q (sk+1 , ak+1 )
 - Q (st+1 , at+1 ) + Rt+1

(1)

= Rt

qs,a P (qs,a |D) dqs,a qs,a
qs,a R mMC ,TD

 = rt+1 +Q (st+1 , at+1 )-Q (st+1 , at+1 )+Rt+1

= =

P (qs,a |m)P (m|D) dqs,a (8)

= rt+1 +  (1 - )Q (st+1 , at+1 ) +

 Rt+1

(7)

Combining the update in (3) with the error Q in (6) t  and the recursive definition of Rt in (7), we obtain the offline SARSA() algorithm shown in Algorithm 2 that is called once at the end of every trial of policy evaluation of T time steps.
Meaning that in a learning task consisting of trials, the TD() update is performed once for all encountered states at the end of a trial.
1

EP (qs,a |m) [qs,a ] P (m|D)
mMC ,TD prediction of m weight of m

The first step expands the definition of expectation and marginalizes out irrelevant random variables. The second step introduces Bayesian model averaging by introducing and marginalizing over an additional model parameter m specifying either the MC (1) (1-step local Monte Carlo sample return Rt+1 ) or

Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting Lambda

TD (1-step temporal difference bootstrapped return Q (st+1 , at+1 )) models, assuming each model of qs,a is independent of D given m. Finally, distributing qs,a , swapping the and , and exploiting the independence of P (m|D) from qs,a , the integral is expressed in expectation form. This yields the intuitive Bayesian model averaging result that we should weight (i.e., trust) the expected predictions of the MC and TD models by the conditional probability of these models given the data D; this fundamental observation underlies the premise of the paper: If we arbitrarily set P (MC |D) =  and P (TD|D) = BMA  (1 - ) then Rt = Rt and we have exactly rederived the offline view of SARSA(). But given this Bayesian model averaging perspective, we ask whether P (MC |D) =  and P (TD|D) = (1-) with fixed  are really sensible model weightings? In fact, these weights are not data dependent as the form of P (MC |D) and P (TD|D) suggest they should be. In the next section we derive a novel model-weighting for P (MC |D) and P (TD|D) that is both data-dependent and as efficient to compute as the non-data-dependent SARSA(). 3.2. Gaussian Model We assume Q-values of each state-action pair are independently Gaussian distributed (Dearden et al., 1998), thus we write local models m over each qs,a as follows:2 P (qs,a |m) =
m N (qs,a ; µm , (s,a )2 ). s,a

BMA Ds,a where Ds,a = {Rt |s = st , a = at } is the set of observed returns. But this would require MC TD caching all data and recomputing s,a and s,a from MC TD the new means µs,a and µs,a on each state-action visit. As this is prohibitively expensive, we must avoid caching D. However, we note that an excellent surrogate standard deviation that captures the general deviation characteristics of the observed returns Ds,a is simply the standard deviation s,a of Ds,a : aA,sS

P µs,a =

dDs,a

d s,a =

P

dDs,a (d

- µs,a )2

!1 2

|Ds,a |

|Ds,a |

MC TD Hence we set both s,a = s,a = s,a . Later in the context of TD-BMA in Algorithm 3, we will discuss how the µs,a and s,a can be updated efficiently online. m Now that we've specified µm and s,a to define s,a P (qs,a |m) for both m  {MC , TD} in (9), we return to (8). For (8), we need to derive two quantities to BMA compute Rt : (a) the model prediction which we can trivially derive from the known properties of the m normal distribution: EP (qs,a |m) [qs,a ] = µm = qs,a ; s,a and (b) the model weight P (m|D), which we derive as 2 follows where m = ms,a and C = (2s,a )-1/2 :

P (m|D) = P (m|Ds,a )
Unif. prior

Indep.

Bayes rule


P (Ds,a |m)P (m)


P (Ds,a |m)

i.i.d.

=

(9)

= =

dDs,a P (d|m) m 2 dDs,a N (d; qs,a , s,a ) dDs,a Ce ,, -
m (d-qs,a )2 2 2s,a

Each model m is summarized by sufficient statistics m (µm , s,a ) representing the mean and standard devis,a ation for the normal distribution N over each qs,a . For models m  {MC , TD}, the model means are reMC TD MC spectively µMC = qs,a and µTD = qs,a where qs,a s,a s,a TD and qs,a are defined w.r.t. s = st and a = at :
TD qs,a = rt+1 + Q (st+1 , at+1 ) MC qs,a

= C |Ds,a | e

1 2 2s,a

« dDs,a

m m -d2 - (qs,a )2 + 2qs,a d

(12) In order, we exploited independence of the model m (for s, a) from data other than Ds,a , Bayes rule, an assumption of a uniform prior P (m) over models m, the i.i.d. assumption, the definition of the normal distribum 2 tion N (d; qs,a , s,a ) and from there, simple exponential and algebraic identities. Next we simplify :
=- X
dDs,a

(10) (11)

= rt+1 +

BMA Rt+1

MC TD We note some crucial details: (a) qs,a and qs,a are samples of random variable qs,a -- we construct Bayesian models of qs,a using these MC and TD samples as means µMC and µTD ; (b) looking ahead we pros,a s,a BMA vide the recursive definition of Rt+1 in (15); (c) technically, we note the MC return is only locally Monte Carlo (not a true MC return to the end of the trial).

d2 -

X
dDs,a

m m (qs,a )2 + 2qs,a

X
dDs,a

d

2 m m = -|Ds,a |(s,a + µ2 ) + -|Ds,a |(qs,a )2 + 2qs,a |Ds,a |µs,a s,a ` 2 ´ m m = -|Ds,a | s,a + µ2 + (qs,a )2 - 2qs,a µs,a s,a 2 m = -|Ds,a |s,a + -|Ds,a |(qs,a - µs,a )2

(13)

Given the means and we could explicMC TD itly compute s,a and s,a from the data set D =
For each m  {MC , TD}, there is actually a set of models {ms,a } for all state-action pairs; we drop the subscript and write m = ms,a when s, a are clear from context.
2

µMC s,a

µTD , s,a

Here we rewrote the first
P
2 s,a = dDs,a

using the identity:
- µ2 s,a

d2

|Ds,a |
2 dDs,a d

=

P

2 = |Ds,a |(s,a + µ2 ) s,a

Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting Lambda

We simplified the second noting that the summand is independent of the variable being summed over. And for the third , we have simply noted this is the definition of the mean µs,a of Ds,a (modulo |Ds,a |). Now, substituting (13) into P (m|D)  C =e
|Ds,a | -

Algorithm 3 TD-BMA-Offline-Update(T )
begin BMA RT = rT +1 for (t = T . . . 0) do // For readability, let s = st , a = at if (t < T ) then // Compute model expectations TD qs,a = rt+1 + Q (st+1 , at+1 ) MC BMA qs,a = rt+1 + Rt+1 // Compute BMA model weighting  //  = P (MC s,a |D) // 1 -  = 1 - P (MC s,a |D) = P (TD s,a |D) =
MC 2 N (qs,a ;µs,a ,s,a )Ns,a MC 2 TD 2 N (qs,a ;µs,a ,s,a )Ns,a +N (qs,a ;µs,a ,s,a )Ns,a

of (12), we obtain:
-
|Ds,a | 2 2s,a m (qs,a -µs,a ) 2

e

2 |Ds,a |s,a 2 2s,a

|D | - s,a 2 |Ds,a | 2

C |Ds,a | e

-

m (qs,a -µs,a ) 2 2s,a

2

|Ds,a |

= e-

m 2 N (qs,a ; µs,a , s,a )|Ds,a |

(14)

This is both a pleasing and important result. We bem gan with P (m|D) where the model m had a mean qs,a likely to change on every update (it is itself a sample). This raised the possibility that we might have needed to recompute the entire model posterior P (m|D) every time the data set D was updated. However, by various manipulations and identities, we showed in the final result that in fact we need only know |Ds,a | and 2 the first and second moments (µs,a and s,a ) of Ds,a in order to compute P (m|D). This is important since we will show next that all of these statistics can be updated online in constant time without caching D! Finally we derive the exact individual probabilities P (m|D) for m  {MC , TD} where (a) Ns,a = |Ds,a | and (b) we note the proportionality constants (call them CMC and CTD ) for respective models P (MC |D) and P (TD|D) are independent of the model and hence factor and cancel from P (m|D), leaving just N (·): P (m|D) = = P (m|D) m{MC ,TD} P (m|D)

// Compute BMA return; note the BMA TD MC // equivalence: ^Rt = qs,a (1 - ) + qs,a  ~ BMA BMA Rt =rt+1 + (1 - )Q (st+1 , at+1 )+Rt+1 ~ ^ BMA - Q (s, a) Q (s, a) = Q (s, a) +  Rt // Update µ,  sufficient statistics (Knuth, 1998) Ns,a = Ns,a + 1 BMA - µs,a  = Rt µs,a = µs,a + N s,a
2 BMA 2 - µs,a ) Ms,a = Ms,a +  · (Rt if Ns,a > 1p then 2 s,a = Ms,a /Ns,a

end

return estimate in a Bayesian framework where  is automatically adapted to the data using the Bayesian model averaging perspective. From this we note that we need only modify the SARSA() update in Algorithm 2 in three ways to obtain TD-BMA in Algorithm 3: (a) in place of a fixed  calculation, we compute the BMA adaptive version of ; (b) from this , we showed in (15) BMA  we can compute Rt analogously to Rt , then in Q-value update (3) we replace the error Q with t BMA QBMA = Rt - Q (st , at ) ; (c) using the new t BMA return Rt , we do an online update of the sufficient statistics Ns,a , µs,a , s,a of Ds,a required by TD-BMA. Constant time and space online algorithms for maintaining the mean µs,a and standard deviation s,a as new data is accumulated in Ds,a are given in (Knuth, 1998) and are shown in Algorithm 3; these updates also maintain Ns,a , which trivially counts the number 2 of updates for Q (s, a), and Ms,a , which is required for the numerically stable, efficient online computation of 2 s,a . For initialization, we assume µs,a = 0, s,a = , 2 Ns,a = 0 and Ms,a = 0.

m 2 [N (qs,a ; µs,a , s,a )]|Ns,a | MC 2 TD 2 [N (qs,a ; µs,a , s,a )]|Ns,a | + [N (qs,a ; µs,a , s,a )]|Ns,a |

With this result, we have derived all components reBMA quired to compute Rt from (8) for the Gaussian TD BMA model. Recalling definitions of qs,a in (10) MC and qs,a in (11) and defining  = P (MC s,a |D) (then 1 -  = 1 - P (MC s,a |D) = P (TD s,a |D)), we obtain:
BMA Rt =

X
mMC s,a ,TD s,a

EP (qs,a |m) [qs,a ] P (m|D)

TD MC = qs,a P (TD s,a |D) + qs,a P (MC s,a |D) BMA = (rt+1 + Q (st+1 , at+1 ))(1 - ) + (rt+1 + Rt+1 ) BMA = (1-)rt+1 +(rt+1 )+(1-)Q (st+1 , at+1 )+Rt+1 h i BMA = rt+1 +  (1 - )Q (st+1 , at+1 ) + Rt+1 (15)

This final result is exactly the same form as (7) except that here we have re-derived the SARSA() TD

Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting Lambda
S S S S S

S D D P P

P P D
F F F

D

F

F

(a) Cliff

(b) Corner

(c) Curves

(d) Mail

Figure 1. (a,b,c) Various Racetrack domains evaluated in this paper. Initial states are labeled 'S', terminal states are labeled 'F'. Black squares delineate walls and whitespace indicate legal car coordinates. Gray indicates obstacles that reset the car to the initial state and result in a penalty of -100. (d) The Mail Robot domain evaluated in this paper. The robot can pick up mail at blocks labeled P that is intended for different dropoff points labeled D. Gray squares in this domain incur a penalty of -100, but do not reset the robot state.

Complexity Analysis: One can easily verify that both SARSA() and TD-BMA require constant time per st , at update and hence each trial of T steps takes O(T ) time for both algorithms. While TDBMA requires four more constants per state-action 2 2 pair (Ns,a , µs,a , s,a , Ms,a ) than SARSA(), both algorithms clearly still have O(|S||A|) space complexity. Hence TD-BMA and SARSA() have the same time and space complexity, even though TD-BMA manages to adapt  in a Bayesian model averaging framework.

velocity x , y in each coordinate direction. A car begins at one of the initial start states (chosen uniformly randomly) with velocity x , y = 0, 0 , receives -1 for every action taken, except for 0 in the absorbing goal states. Actions x , y available to the car are integer accelerations {-1, 0, 1} in each coordinate direction. If the car hits a wall, then its velocity x , y is reset to 0, 0 . We use discount  = 1 and terminate trials when T > 200 steps. Every action costs -1 to execute, except for grayed areas which incur penalty -100 and cause a reset to the initial state. The three Racetrack domains we use in this paper are shown in Figure 1(a,b,c). Mail Robot Formally, the state in Mail Robot is a combination of a robot's coordinate position x, y , velocity x , y , the boolean status variables p1 , p2 , p3 , p4 for each pickup point pi indicating whether mail has arrived, holders on the robot for up to four pieces of mail m1 , m2 , m3 , m4 where each piece of mail mi (if it exists) is assigned to one of the four dropoff points dj , i.e., mi  {d1 , d2 , d3 , d4 }. Locomotion in this problem is identical to the Racetrack domain. Mail randomly arrives continuously with a mean interval of 30 time steps at each pickup point. The reward is zero for all actions except encountering a gray square (-100) or successfully delivering a piece of mail (+4). We use discount factor 0.9 and terminate trials after T = 300 time steps. Figure 1(d) shows the topology of the mail robot domain that we evaluate in this paper. We note that taking into account all state variables, the robot could potentially explore millions of reachable states in this problem, making it very difficult for RL algorithms.

4. Empirical Results
Having worked through the derivation of TD-BMA, we now reach the moment of truth: does TD-BMA really provide a better way of adapting  than using any fixed  across different problem domains? To answer this question, we examine the SARSA() and TD-BMA policy evaluation algorithms respectively defined in Algorithms 2 and 3 within the same Generalized Policy Iteration framework of Algorithm 1 where = 0.05 and we update the policy  after every trial. We evaluate SARSA() and TD-BMA on two very different reinforcement learning domains: a variant of the Racetrack domain (Barto et al., 1993) with the objective to reach the goal as fast as possible with fewest penalties, and a large state-space variant of the Mail robot domain with repeating multiple objectives to collect mail that randomly arrives at pickup points and deliver it to the dropoff points that the mail is addressed to. Racetrack Formally, the state in Racetrack is a combination of a car's coordinate position x, y and

Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting Lambda
Average Reward vs. Trials: Racetrack Cliff
-20 -20 -22 -24 -26 -28 -30 -32 -34 -36 -38 -40 2000

Average Reward vs. Trials: Racetrack Corner
140

Average Reward vs. Trials: Mail Robot

Average Policy Reward

Average Policy Reward

Average Policy Reward

120

-25

100

-30

80

TD-BMA SARSA(.8) SARSA(.6) SARSA(.4) SARSA(.2)
4000 6000 8000 10000 12000 14000 16000

-35

TD-BMA SARSA(.8) SARSA(.6) SARSA(.4) SARSA(.2)
0.5 1 1.5 2 2.5 3 3.5 x 10
4

60

40

TD-BMA SARSA(.8) SARSA(.6) SARSA(.4) SARSA(.2)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 x 10
4

-40

20

# Trials

# Trials

# Trials

Figure 2. Average reward for policy vs. the # of learning trials on two configurations of the Racetrack domain and the single configuration of the Mail Robot domain. 95% confidence intervals obtained from averaging policy performance over 100 separate training runs are shown for each algorithm.
Effect of Varying : Racetrack Curves
160 -100 140 0.6 0.55 120 0.5

Effect of Varying : Mail Robot
0.65

Lambda vs. Trials
Racetrack Curves Mail Robot Racetrack Corner Racetrack Cliff

Average Policy Reward

-200

Lambda

-300

Average Policy Reward

100

0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

80

-400

-500

-600

TD-BMA =.05 TD-BMA =.15 TD-BMA =.25 SARSA() =.05 SARSA() =.15 SARSA() =.25
0 1 2 3 4 5 6 7 8 9 x 10
4

60

40

20

TD-BMA =.15 TD-BMA =.25 TD-BMA =.35 SARSA() =.15 SARSA() =.25 SARSA() =.35
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 x 10
4

0

# Trials

# Trials

Percent Trials Complete

Figure 3. (left,center) Average reward for policy vs. the # of learning trials for various  settings of SARSA() and TDBMA on one Racetrack and one Mail Robot domain with 95% confidence intervals obtained from averaging policy performance over 100 separate training runs. (right) Average value of  per trial vs. number of trials for various domains.

4.1. Results Summary In Figure 2, we show the average policy reward for the policy vs. the number of learning trials with a limit on T as defined above for each respective problem and  = .15 for all problems. We do not show  = 0 (pure TD) or  = 1.0 (pure Monte Carlo) as these were always outperformed by one of the other  values for SARSA(). Most importantly, we note that the adaptive  of TD-BMA always outperforms all other fixed values of  in SARSA() at all stages of learning. In Figure 3 (left,center), we show the performance of TD-BMA and the best SARSA() for different learning rates . For Mail Robot we see that TD-BMA for any  outperforms the best SARSA() over all . For Racetrack Curves, we see that TD-BMA outperforms SARSA() for the same , except  = 0.05. TD-BMA performs best with more aggressive learning rates since it reduces variance in the return estimates. For Figure 3 (right), we see the average  per trial vs. the percent trials completed (different domains had differing numbers of trials). We see that  generally

decreases over time as the value function stabilizes and TD-BMA learns it can trust the TD return. The exception is Mail Robot where the state is techically partially observed since mail arrival is time-dependent, yet time is not encoded in the state; since the MC return provides useful time-dependent information not given by the time-independent TD return, TD-BMA settles on   0.5 to maintain the best value estimate.

5. Related Work
While the mention of Bayesian reinforcement learning combined with a Gaussian value belief model may suggest a similarity to Gaussian process temporal difference learning (GPTD) (Engel et al., 2005), the two frameworks are very different in intent. GPTD learning makes use of the non-parametric Gaussian Process in its Bayesian model, which is very different in spirit and computation to the more standard independent Gaussian distribution per state-action pair used here. Most importantly, we note that to the best of our knowledge the GPTD approach has not been used to explicitly adapt the  parameter in TD methods,

Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting Lambda

which is the purpose of our Bayesian model and derivation in TD-BMA. In addition to GPTD, there are numerous other Bayesian RL approaches, e.g. (Dearden et al., 1998; 1999; Wang et al., 2005; Poupart et al., 2006). However, the focus of these other Bayesian RL approaches is typically to exploit the value distribution for the purpose of balancing the exploration vs. exploitation tradeoff. Again, our present work is concerned simply with automatically adapting the  parameter in a Bayesian setting although we note that many of the ideas proposed in these papers that exploit the Bayesian value model for the exploration-exploitation tradeoff may be adapted to use the value model and efficient update algorithm presented here. (Sutton & Singh, 1994) propose three alternate methods for adaptive weighting schemes in TD algorithms. However, their first two proposals are intended for the restricted subclass of acyclic state spaces only (the domains evaluated here were cyclic) and the third proposal is a model-based RL algorithm that requires maintaining an estimate of the model (n.b., we did not incur the time and memory expense of maintaining a model estimate of T (s, a, s ) for TD-BMA), which for a fixed policy could incur up to O(|S|2 ) time and space overhead not required by TD-BMA.

would enable an online update permitting within-trial TD-BMA learning. A side benefit of an eligibility trace-like approach to TD-BMA is that it provides one way to introduce function approximation techniques to TD-BMA similarly to the way eligibility traces facilitate function approximation in TD/SARSA(). Together these ideas hold promise for a novel class of data-dependent, Bayesian -adaptive TD/SARSA() algorithms with better performance than any fixed .

Acknowledgements
We thank the anonymous reviewers and David Silver for their comments. NICTA is funded by the Australian Government's Backing Australia's Ability initiative, and the Australian Research Council's ICT Centre of Excellence program.

References
Barto, Andrew G., Bradtke, Steven J., and Singh, Satinder P. Learning to act using real-time dynamic programming. Technical Report UM-CS-1993-002, U. Mass. Amherst, 1993. Dearden, Richard, Friedman, Nir, and Russell, Stuart J. Bayesian q-learning. In AAAI/IAAI, pp. 761­768, 1998. Dearden, Richard, Friedman, Nir, and Andre, David. Model based bayesian exploration. In UAI, 1999. Engel, Yaakov, Mannor, Shie, and Meir, Ron. Reinforcement learning with Gaussian processes. In ICML-05, pp. 201­208, New York, NY, USA, 2005. Kakade, Sham. On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, London, UK, March 2003. Knuth, Donald E. The Art of Computer Programming, Volume 2: Seminumerical Algorithms. AddisonWesley, Boston, 1998. Poupart, Pascal, Vlassis, Nikos, Hoey, Jesse, and Regan, Kevin. An analytic solution to discrete bayesian reinforcement learning. In ICML '06, pp. 697­704, New York, NY, USA, 2006. ACM. Puterman, Martin L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, 1994. Sutton, R. S. and Singh, S. P. On bias and step size in temporal-difference learning. In Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, pp. 91­96, New Haven, CT, 1994. Sutton, Richard and Barto, Andrew. Reinforcement Learning. MIT Press, 1998. Wang, Tao, Lizotte, Daniel, Bowling, Michael, and Schuurmans, Dale. Bayesian sparse sampling for online reward optimization. In ICML-05, 2005.

6. Concluding Remarks
This work proposed a Bayesian model averaging approach to adapting the  parameter in temporal difference reinforcement learning methods. We began by deriving a novel Bayesian model averaging perspective of the temporal difference update intended to reduce variance in the average of the different n-step returns of TD-based SARSA() and showed that while the Bayesian perspective suggests adapting  in a datadependent way, standard SARSA() approaches simply fix this parameter and hand-tune it to a particular problem. Using our novel Bayesian perspective, we contributed the efficient Gaussian-based TD-BMA algorithm to compute a temporal difference Bayesian model average of returns that has exactly the same time and space complexity as SARSA() while automatically adapting  in light of all previously seen data. Empirically, TD-BMA performed as well and generally much better than SARSA() for all fixed values of  without requiring manual tuning of the  parameter. Important steps for future work are to determine whether TD-BMA can be extended to accommodate changing  using an extension of the eligibility trace used in TD/SARSA() (Sutton & Barto, 1998). This