Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesv´ ri, a Eric Wiewiora Reinforcement Learning and Artificial Intelligence Laboratory, University of Alberta, Edmonton, Canada School of Computer Science, McGill University, Montreal, Canada Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India Abstract Sutton, Szepesv´ ri and Maei (2009) recently ina troduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their gradient temporal difference (GTD) algorithm converges reliably, it can be very slow compared to conventional linear TD (on on-policy problems where TD is convergent), calling into question its practical utility. In this paper we introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD. This algorithm appears to extend linear TD to off-policy learning with no penalty in performance while only doubling computational requirements. many of its large-scale applications. However, the simplest and most popular methods, including TD(), Q-learning, and Sarsa, are not true gradient-descent methods (Barnard 1993) and, as a result, the conditions under which they converge are narrower and less robust than can usually be guaranteed for gradient-descent methods. In particular, convergence cannot be guaranteed for these methods when they are used with off-policy training (Baird 1995). Off-policy training--training on data from one policy in order to learn the value of another--is useful in dealing with the exploration-exploitation tradeoff and for intra-option learning (Sutton, Precup & Singh 1999). Finding an offpolicy temporal-difference algorithm that is effective on large applications with linear function approximation has been one of the most important open problems in reinforcement learning for more than a decade. Several non-gradient-descent approaches to this problem have been developed, but none has been completely satisfactory. Second-order methods, such as LSTD (Bradtke & Barto 1996; Boyan 1999), can be guaranteed stable under general conditions, but their computational complexity is O(n2 ), where n is the number of features used in the linear approximator, as compared to O(n) for TD() and the other simple methods. In applications, n is often too large (e.g., n = 106 in Computer Go, Silver et al. 2007) for second-order methods to be feasible. Incremental methods such as iLSTD (Geramifard et al. 2006) reduce per-timestep computation to O(n), but still require O(n2 ) memory. Precup and colleagues (2001; 2006) have explored O(n) solutions based on importance sampling, but these methods still have the potential for instability, converge slowly, or apply only to special cases. In this paper we explore the development of true stochastic gradient-descent algorithms for temporal-difference learning with linear function approximation, building on the work of Baird (1995; 1999) and of Sutton, Szepesv´ ri and a Maei (2009). 1. Motivation Temporal-difference methods based on gradient descent and linear function approximation form a core part of the modern field of reinforcement learning and are essential to Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Fast gradient-descent methods for temporal-difference learning with linear function approximation 2. Linear value-function approximation We consider a prototypical case of temporal-difference learning, that of learning a linear approximation to the state-value function for a given policy and Markov decision process (MDP) from sample transitions. We take both the MDP and the policy to be stationary, so their combination determines the stochastic dynamics of a Markov chain. The state of the chain at each time t is a random variable, denoted st {1, 2, ..., N }, and the state-transition probabilities are given by a matrix P . On each transition from st to st+1 , there is also a reward, rt+1 , whose distribution depends on both states. We seek to learn the parameter n of an approximate value function V : S such that 3. Objective functions An objective function is some function of the modifiable parameter that we seek to minimize by updating . In gradient descent, the updates to are proportional to the gradient or sample gradient of the objective function with respect to . The first question then, is what to use for the objective function? For example, one natural choice might be the mean squared error (MSE) between the approximate value function V and the true value function V , averaged over the state space according to how often each state occurs. The MSE objective function is MSE() = s def ds (V (s) - V (s)) V - V 2 D 2 = . V (s) = s V (s) = E t=0 rt+1 | s0 = s , (1) t where s n is a feature vector characterizing state s, and [0, 1) is a constant called the discount rate. In this paper we consider one-step temporal-difference learning (corresponding to = 0 in TD()), in which there is one independent update to for each state transition and associated reward. There are several settings corresponding to how the state transitions are generated. In the on-policy setting, for example, the state transitions come directly from the continuing evolution of the Markov chain. We assume that the Markov chain is ergodic and uni-chain, so there exists a limiting distribution d such that ds = limt P(st = s).1 In the on-policy case, d is linked to the transition probabilities (in particular, we know that P d = d) and this linkage is critical to the convergence of algorithms such as conventional TD. In this paper, we consider a general setting (introduced in Sutton, Szepesv´ ri a & Maei 2009) in which the first state of each transition is chosen i.i.d. according to an arbitrary distribution d that may be unrelated to P (this corresponds to off-policy learning). This setting defines a probability over independent triples of state, next state, and reward random variables, denoted (sk , sk , rk ), with associated feature-vector random variables k = sk and k = sk . From these we can define, for example, the temporal-difference error, k = rk + k k - k k , used in the conventional linear TD algorithm (Sutton 1988): k+1 k + k k k , (2) where k is a sequence of positive step-size parameters. Our results apply also to the episodic case if ds is taken to be the proportion of time steps in state s. In this case, the sum in (1) is only over a finite number of time steps, the rows of P may sum to less than 1, and may be equal to 1 (as long as (P ) = 0). 1 In the second equation, V and V are viewed as vectors with one element for each state, and the norm v 2 = v Dv D is weighted by the matrix D that has the ds on its diagonal. In temporal-difference methods, the idea is instead to use an objective function representing how closely the approximate value function satisfies the Bellman equation. The true value function V satisfies the Bellman equation exactly: V = def R + P V T V, = where R is the vector with components E{rt+1 | st = s} and T is known as the Bellman operator. A seemingly natural measure of how closely the approximation V satisfies the Bellman equation is the mean-square Bellman error: MSBE() = V - T V 2 D . (3) This is the objective function used by the most important prior effort to develop gradient-descent algorithms, that by Baird (1995, 1999). However, most temporal-difference algorithms, including TD, LSTD, and GTD, do not converge to the minimum of the MSBE. To understand this, note that the Bellman operator follows the underlying state dynamics of the Markov chain, irrespective of the structure of the function approximator. As a result, T V will typically not be representable as V for any . Consider the projection operator which takes any value function v and projects it to the nearest value function representable by the function approximator: v = V where = arg min V - v 2 D . In a linear architecture, in which V = (where is the matrix whose rows are the s ), the projection operator is linear and independent of : = ( D)-1 D Fast gradient-descent methods for temporal-difference learning with linear function approximation B RM S BE 1 B A C 1 A1 B 1 TV A C A T V , D RMSPBE TV 0 0 Residual-gradient solution A2 C 0 TD-fixpoint solution With function approx... Figure 1. Geometric relationships between the square roots of the two Bellman-error objective functions. All the algorithms mentioned above find or converge to a fixpoint of the composed projection and Bellman operators, that is to a value of such that V = T V . (4) We call this value of the TD fixpoint. In the current work, we take as our objective the deviation from this fixpoint. That is, we use as our objective function the mean-square projected Bellman error: MSPBE() = V - T V 2 D . (5) Figure 1 shows the relationship between this and the MSBE objective function geometrically. Although many previous works have highlighted the goal of achieving the TD fixpoint (4), the present work seems to be the first to focus on the MSPBE as an objective function to be minimized (but see Antos, Szepesv´ ri and Munos 2008, p. 100). a Further insight into the difference between the two Bellman error objective functions can be gained by considering the episodic example in Figure 2. Finally, we close this discussion of objective functions by giving the function used to derive the original GTD algorithm. This objective function does not seem to have a ready geometric interpretation. Here we call it the norm of the expected TD update: NEU() = E[] E[] . (6) 4. Derivation of the new algorithms In this section we derive two new algorithms as stochastic gradient descent in the projected Bellman error objective (5). We first establish some relationships between the relevant expectations and vector-matrix quantities: E = s ds s s = D, E[] = s ds s Rs + s Pss V (s ) - V (s) = D(T V - V ), Figure 2. Backward-bootstrapping example. In the left and middle panels, episodes begin in state A then transition either to B or to C with equal probability before proceeding to termination with a reward of 1 or 0 (all other transitions have zero reward). The vertical positions of the states represent their values according to the TD-fixpoint solution (left panel) and according to the residualgradient (RG) solution (middle panel; Baird 1995, 1999). State A, for example, has height midway between 0 and 1 in both solutions, corresponding to its correct value of 1 (because episodes 2 starting in A end half the time with a total reward of 1 and half the time with a total reward of 0, and = 1). In the TD solution, states B and C are given values of 1 and 0 respectively, whereas in the RG solution they are given the values 3 and 1 . The 1,0 4 4 values are correct in that these states are always followed by these 1 rewards, but they result in large TD errors, of = ± 2 , on transi1 tions out of A. The RG solution has smaller TD errors, of = ± 4 , on all of its transitions, resulting in a smaller mean-square TD er2 2 ror per episode of 1 × 2 = 1 as compared to 1 = 1 for the 4 8 2 4 TD solution. That is, the RG solution splits the TD error over two transitions to minimize squared TD error overall. The RG solution is also sometimes described as backwards bootstrapping-- making the value of a state look like the value of the state that preceded it as well as the state that followed it. It has long been recognized that backwards bootstrapping is to be avoided (Sutton 1988; Dayan 1992) but the RG algorithm has remained of interest because it is a gradient-descent method and thus guaranteed to converge (whereas TD() converges only on-policy) and because it has a "two sample version" that minimizes the MSBE rather than the squared TD error. The key difference here is that, from A, the squared TD error tends to be large but the expected TD error (the Bellman error) tends to be zero (as long as the B and C values are distributed symmetrically around 1 ). The TD solution 1,0 2 is in fact the minimum MSBE solution on this problem, and this has led to the widespread belief that the MSBE solves the problem of backwards bootstrapping. However, this is not the case in general; once function approximation is introduced, the MSBE 1 and MSPBE solutions differ, and the 3 , 4 solution may reappear. 4 An example of this is shown in the right panel, where the previous state A is split into two states, A1 and A2, that share the same feature representation; they look the same and must be given the same approximate value. Trajectories start in one of the two A states each with 50% probability, then proceed deterministically either to B and 1, or to C and 0. From the observable data, this example looks just like the previous, except now taking multiple samples is no help because the system is deterministic, and they 3 will all be the same. Now the 4 , 1 solution minimizes not just the 4 squared TD error, but the MSBE as well; only the MSPBE criterion puts the minimum at the 1, 0 solution. The MSBE objective causes function approximation resources to be expended trying to reduce the Bellman error associated with A1 and A2, whereas the MSPBE objective takes into account that their approximated values will ultimately be projected onto the same value function. Fast gradient-descent methods for temporal-difference learning with linear function approximation which is then sampled, resulting in the following O(n) algorithm, which we call TD with gradient correction, or D = (( D) D) D(( D) D) TDC for short: = D ( D)-1 D( D)-1 D k+1 = k + k k k - k (k wk ), (10) = D ( D)-1 D. where wk is generated by (9) as in GTD2. Note that the Using these relationships, the projected objective can be update to k is the sum of two terms, and that the first term written in terms of expectations as is exactly the same as the update (2) of conventional linear TD. The second term is essentially an adjustment or correcMSPBE() 2 tion of the TD update so that it follows the gradient of the = V - T V D MSPBE objective function. If the second parameter vector = (V - T V ) 2 D is initialized to w0 = 0, and k is small, then this algorithm = ((V - T V )) D((V - T V )) will start out making almost the same updates as conventional linear TD. Note also that after the convergence of k , = (V - T V ) D(V - T V ) -1 wk will converge to zero again. = (V - T V ) D ( D) D(V - T V ) and note that -1 -1 = ( D(T V - V )) ( D) -1 -1 D(T V - V ) = E[] E E[] . 5. Proof of convergence of GTD2 The purpose of this section is to establish that the GTD2 algorithm converges with probability one to the TD fixpoint (4) under standard assumptions. Theorem 1 (Convergence of GTD2). Consider the GTD2 iterations (8) and (9) with step-size sequences k and k satisfying k = k , > 0, k , k (0, 1], k=0 k = 2 , k=0 k < . Further assume that (k , rk , k ) is an i.i.d. sequence with uniformly bounded second moments. Let A = E k (k - k ) , b = E[rk k ], and C = E k k . Assume that A and C are non-singular. Then the parameter vector k converges with probability one to the TD fixpoint (4). Proof. The proof is very similar to that given by Sutton, Szepesv´ ri and Maei (2009) for GTD, and we refer the a reader to that reference for further details. It is shown there that the TD fixpoint can be written as the condition -A + b = 0. (11) From this form, it is clear that MSPBE differs from NEU (6) only by the inclusion of the inverse of the featurecovariance matrix. As in prior work (Sutton, Szepesvari & Maei 2009) we use a second modifiable parameter w n to form a quasi-stationary estimate of all but one of the expectations in the gradient of the objective function, thereby avoiding the need for two independent samples. Here we use a conventional linear predictor which causes w to approximate -1 w E E[] . (7) Using this, we can write the negative gradient of the MSPBE objective function as - 1 MSPBE() 2 = E ( - ) E ( - ) E w, -1 E[] which can be directly sampled. The resultant O(n) algorithm, which we call GTD2, is k+1 = k + k (k - k )(k wk ), where wk is updated by wk+1 = wk + k (k - k wk )k . (9) (8) The derivation of our second new algorithm starts from the same expression for the gradient and then takes a slightly different route: 1 - MSPBE() 2 = = = E ( - ) E E -1 First, we rewrite the algorithm's two iterations as a single iteration in a combined parameter vector with 2n com ponents, k = (dk , k ), where dk = wk / , and a new reward-related vector with 2n components, gk+1 = (rk k , 0 ), as follows: k+1 = k + k (Gk+1 k + gk+1 ) , where Gk+1 = - k k (k - k )k k (k - k ) 0 . E[] E -1 -1 - E E[] E[] E[] - E E[] - E E w, Let G = E[Gk ] and g = E[gk ]. Note that G and g are welldefined as by the assumption the process {k , rk , k }k is i.i.d. In particular, - C -A b G= , g= . 0 A 0 Fast gradient-descent methods for temporal-difference learning with linear function approximation Further, note that (11) follows from G + g = 0, where = (d , ). Now we apply the ordinary differential equation (ODE) approach and Theorem 2.2 of Borkar & Meyn (2000). For this purpose we write k+1 = k + k (Gk + g + (Gk+1 - G)k + (gk+1 - g)) = k + k (h(k ) + Mk+1 ), where k = k , h() = g+G and Mk+1 = (Gk+1 -G)k + gk+1 - g. Let Fk = (1 , M1 , . . . , k-1 , Mk ) be sigma fields generated by the quantities i , Mi , i k, k 1. Theorem 2.2 requires the verification of the following conditions: (i) The function h is Lipschitz and h () = limr h(r)/r is well-defined for every 2n ; (iia) The sequence (Mk , Fk ) is a martingale difference sequence, and (ii-b) for some c0 > 0, E Mk+1 2 | Fk c0 (1+ k 2 ) holds for any initial parameter vector 1 ; (iii) The sequence k satisfies 0 < k 1, k=1 k = , 2 k=1 (k ) < +; (iv) The ODE = h () has the origin as a globally asymptotically stable equilibrium and (v) The ODE = h() has a unique globally asymptotically stable equilibrium. Clearly, h() is Lipschitz with coefficient G and h () = G. By construction, (Mk , Fk ) satisfies E[Mk+1 |Fk ] = 0 and Mk Fk , i.e., it is a martingale difference sequence. Condition (ii-b) can be shown to hold by a simple application of the triangle inequality and the boundedness of the second moments of (k , rk , k ). Condition (iii) is satisfied by our conditions on the stepsize sequences k , k . For the last two conditions, we begin by showing that the real parts of all the eigenvalues of G are negative. First, we show that G is non-singular. Using the determinant rule for partitioned matrices, we get det(G) = det(A C -1 A) = (det C)-1 (det A)2 = 0. This indicates that all the eigenvalues of G are non-zero. Now, let C, = 0 be an eigenvalue of G with corresponding normalized eigen2 vector x C2n ; that is, x = x x = 1, where x is the complex conjugate of x. Hence x Gx = . Let x = (x1 , x2 ), where x1 , x2 Cn . Using the definition 2 of G, = x Gx = - x1 C - x Ax2 + x A x1 , 1 2 2 where x1 C = x Cx1 . Because A is real, A = A , 1 and it follows that (x Ax2 ) = x A x1 . Thus, Re() = 2 1 2 Re(x Gx) = - x1 C 0. We are now done if we show that x1 cannot be zero. If x1 = 0, then from = x Gx we get that = 0, which contradicts with = 0. Thus, (iv) is satisfied. Finally, for the ODE = h(), note that = -G-1 g is the unique asymptoti ¯ cally stable equilibrium with V () = (G+g)T (G+g)/2 as its associated strict Liapunov function. The claim now follows. gradient corrections algorithm. Let the step-size sequences k and k , k 0 satisfy in this case k , k > 0, for all 2 2 k, k=0 k = k=0 k = , k=0 k , k=0 k < k and that 0 as k . Further assume that k (k , rk , k ) is an i.i.d. sequence with uniformly bounded second moments. Let A = E k (k - k ) , b = E[rk k ], and C = E k k . Assume that A and C are non-singular matrices. Then the parameter vector k converges with probability one to the TD fixpoint (4). Proof. The proof of this theorem relies on a two-timescale difference in the step-size schedules {k } and {k }; see Borkar (1997) for a convergence analysis of general twotimescale stochastic approximation recursions. The recursions (9) and (10) correspond to the faster and slower recursions respectively. This is because beyond some integer N0 > 0 (i.e., k N0 ), the increments in (9) are uniformly larger than those in (10) and hence converge faster. Along the faster timescale, i.e., the one corresponding to {k } (see Borkar (1997) for a detailed description of faster and slower timescales), the associated system of ODEs corresponds to (t) = 0, (12) w(t) = E[t t | (t)] - Cw(t). (13) The ODE (12) suggests that (t) (i.e., a time invariant parameter) when viewed from the faster timescale. Indeed, recursion (10) can be rewritten as k+1 = k + k (k), k k k - k k wk 0 k ak 0 as k . By almost surely as k , because k the Hirsch lemma (Theorem 1, pp. 339 of Hirsch 1989), it follows that k - 0 almost surely as k for some that depends on the initial condition 0 of recursion (10). where, from (10), (k) = Consider now the recursion (9). Let Mk+1 = (k k -k T wk ) -E (k k - k T wk ) | F(k) , where k k F(k) = (wl , l , l k; s , s , rs , s < k), k 1 are the sigma fields generated by w0 , 0 , wl+1 , l+1 , l , l , 0 l < k. It is easy to verify that Mk+1 , k 0 are integrable random variables that satisfy E[Mk+1 | F(k)] = 0, k 0. Also, because rk , k and k have uniformly bounded second moments, it can be seen that E Mk+1 2 | F(k) c1 (1+ wk 2 + k 2 ), 6. Proof of Convergence of TDC Theorem 2 (Convergence of TD with Gradient Correction). Consider the iterations (10) and (9) of the TD with for all k 0, for some constant c1 > 0. Now consider the ODE pair (12)-(13). Because (t) from (12), the ODE (13) can be written as w(t) = E[t t | ] - Cw(t). (14) Fast gradient-descent methods for temporal-difference learning with linear function approximation Now consider the function h(w) = E[ | ] - Cw, i.e., the driving vector field of the ODE (14). For (14), w = C -1 E[ | ] is the unique globally asymptotically stable equilibrium. Let h (·) be the function defined by h(rw) . Then h (w) = -Cw. For the h (w) = lim r r ODE w(t) = h (w(t)) = -Cw(t), the origin is a globally asymptotically stable equilibrium because C is a positive definite matrix (because it is nonnegative definite and nonsingular). Assumptions (A1) and (A2) of Borkar & Meyn 2000 are now verified and by their Theorem 2.2 we obtain that wk - w 0 almost surely as k . Consider now the slower timescale recursion (10). In the light of the above, (10) can be rewritten as k+1 = k + k k k - k k k C -1 E[k k | k ] . (15) Let G(k) = (l , l k; s , s , rs , s < k) be sigma fields generated by the quantities 0 , l+1 , l , l , 0 l < k. Let Zk+1 = = (k k - k k C -1 E[k k | k ]) - E (k k - k k C -1 E[k k | k ]) | G(k) (k k - k k C -1 E[k k | k ]) - E[k k | k ] - E k k C -1 E[k k | k ] . It is easy to see that Zk , k 0 are integrable random variables and E[Zk+1 | G(k)] = 0, k 0. Further, E Zk+1 2 Because C -1 is positive definite and A has full rank (as it is nonsingular by assumption), the matrix AT C -1 A is also positive definite. The ODE (17) has the origin as its unique globally asymptotically stable equilibrium. The assumptions (A1)-(A2) of Borkar & Meyn 2000 are once again verified and the claim follows. 7. Empirical Results To begin to assess the practical utility of the new algorithms, we compared their empirical learning rate to that of GTD and conventional TD on four small problems--three random-walk problems and a Boyan-chain problem--and a larger Computer Go problem. All of these problems were episodic, undiscounted, and involved only on-policy training with a fixed policy. The random-walk problems were all based on the standard Markov chain (Sutton 1988; Sutton & Barto 1998) with a linear arrangement of five states plus two absorbing terminal states at each end. Episodes began in the center state of the five, then transitioned randomly with equal probability to a neighboring state until a terminal state was reached. The rewards were zero everywhere except on transition into the right terminal state, upon which the reward was +1. We used three versions of this problem, differing only in their feature representations. The first representation, which we call tabular features, was the familiar table-lookup case in which, for example, the second state was represented by the vector 2 = (0, 1, 0, 0, 0) . The second representation, which we call inverted features, was chosen to cause extensive inappropriate generalization between states; it represented the second state by 2 = ( 1 , 0, 1 , 1 , 1 ) (the value 2 2 2 2 1 2 was chosen to give the feature vectors unit norm). The third representation, which we called dependent features, used only n = 3 features and was not sufficient to solve the problem exactly. The feature vectors for the five states, 1 1 left to right, were 1 = (1, 0, 0) , 2 = ( 2 , 2 , 0) , 1 1 1 1 1 3 = ( 3 , 3 , 3 ) , 4 = (0, 2 , 2 ) , and 5 = (0, 0, 1) . The Boyan-chain problem is a standard episodic task for comparing TD-style algorithms with linear function approximation (see Boyan 2002 for details). We used the version with 14 states and n = 4 features. We applied GTD, GTD2, TDC, and TD to these problems with a range of constant values for their step-size parameters. The parameter was varied over a wide range of values, in powers of 2. For the GTD, GTD2, and TDC algorithms, the ratio = / took values from the set { 1 , 1 , 1, 2} for the random-walk problems; one lower 4 2 power of two was added for the Boyan-chain problem. The initial parameter vectors, 0 and w0 , were set to 0 for all algorithms. Each algorithm and parameter setting was run for 100-500 | G(k) c2 (1+ k 2 ), k 0, for some constant c2 > 0, again because rk , k and k have uniformly bounded second moments. Consider now the following ODE associated with (10): (t) = (I - E T C -1 )E[ | (t)] . (16) ¯ Let h((t)) be the driving vector field of the ODE (16). Note that ¯ h((t)) = = = (I - E T C -1 )E[ | (t)] (C - E T )C -1 E[ | (t)] (E T - E T )C -1 E[ | (t)] = AT C -1 (-A(t) + b), because E[ | (t)] = -A(t) + b. Now = A-1 b can be seen to be the unique globally asymptotically stable equilibrium for (16). Let ¯ h(r) ¯ ¯ . Then h () = -AT C -1 A. Conh () = lim r r sider now the ODE (t) = -AT C -1 A(t). (17) Fast gradient-descent methods for temporal-difference learning with linear function approximation Random Walk - Tabular features .12 .20 GTD GTD2 TD TDC GTD GTD2 TD TDC Random Walk - Inverted features RMSPBE .15 .10 .05 .00 .03 .06 .12 .25 0.5 TD RMSPBE .09 .06 .03 .0 .03 .06 GTD GTD2 TDC TDC GTD GTD2 TD .12 .25 0.5 0 100 200 0 250 500 ! episodes ! Boyan Chain 2.8 TD GTD episodes Random Walk - Dependent features .13 RMSPBE .11 .09 .07 .05 TDC RMSPBE GTD GTD2 TD TD GTD TDC GTD2 2.1 1.4 0.7 0 TDC GTD2 TDC GTD GTD2 TD .008 .015 .03 .06 .12 .25 0.5 0 100 200 300 400 .015 .03 .06 .12 .25 0.5 1 2 0 50 100 ! episodes ! episodes Figure 3. Empirical results on the four small problems--three versions of the 5-state random walk plus the 14-state Boyan chain. In each of the four panels, the right subpanel shows a learning curve at best parameter values, and the left subpanel shows a parameter study plotting the average height of the learning curve for each algorithm, for various = /, as a function of . episodes depending on the problem, with the square root of the MSPBE, MSBE, NEU, and MSE (see Section 4) computed after each episode, then averaged over 100 independent runs. Figure 3 summarizes all the results on the small problems using the MSPBE as the performance measure. The results for the other objective functions were similar in all cases and produced the same rankings. The standard errors are all on the order of 10-4 or less, so are not shown. All the algorithms were similar in terms of their dependence and sensitivity to the step sizes. Overall, GTD learned the slowest, followed after a significant margin by GTD2, followed by TDC and TD. TDC and TD performed similarly, with the extra flexibility provided by sometimes allowing TDC to perform slightly better. To get a measure of how well the new algorithms perform on a larger problem, we applied them to learning an evaluation function for 9x9 Computer Go. We used a version of RLGO (Silver et al. 2007) modified to use a purely linear evaluation function. This system used 969,894 binary features corresponding to all possible shapes in every 3x3, 2x2, and 1x1 region of the board. Using weight sharing to take advantage of location-independent and locationdependent symmetries, the million features are reduced to a parameter vector of n = 63,303 components. With this large of a parameter vector, O(n2 ) methods are not feasible. To make the problem as straightforward as possible, we sought to learn the value function for a fixed policy, in this case for the policy that chose randomly among the legal moves. Experience was generated by self-play, with all rewards zero except upon winning the game, when the 0.8 0.6 GTD2 TDC RNEU 0.4 0.2 TD GTD TDC GTD2 0 .000001 .000003 .00001 .00003 .0001 .0003 .001 ! Figure 4. Residual error in learning an evaluation function for 9x9 Computer Go, as a function of algorithm and step-size, in terms of the square root of the norm of the expected TD update (6). The 1 values used were 16 , 1 , 1 , 1 , 1, and 2. 8 4 2 reward was 1. We applied all four algorithms to this problem with a range of step sizes. In each run, was initialized to random values uniformly distributed in [-0.1, 0.1]. The secondary parameter, w, was initialized to 0. Training then proceeded for 1000 complete games, after which was frozen and another 1000 games run to compute an estimate of an objective function. The objective functions cannot be exactly computed here because of the size of the problem. The NEU objective is the most straightforward to estimate sim- Fast gradient-descent methods for temporal-difference learning with linear function approximation 20 TD 10 GTD2 0 0 GTD 10 TDC 20 30 40 50 Sweeps Figure 5. Learning curves on Baird's off-policy counterexample: TD diverges, whereas the gradient methods converge. This is the 7-state version of the "star" counterexample (Baird 1995), for which divergence is monotonic. Updating was done synchronously in dynamic-programming-like sweeps through the state space. For TD, = 0.1. For the gradient algorithms, = 0.05 and = 10. The initial parameter value was 0 = (1, 1, 1, 1, 1, 1, 10, 1) , and = 0.99. ply by averaging the value of over the 1000 test games and then taking the norm of the resultant vector. It is this performance measure that we recorded and averaged over 40 runs to produce the data shown in Figure 4. The results are remarkably consistent with what we saw in the small problems. The GTD algorithm was the slowest, followed by GTD2, TDC, and TD, though the differences between the last three are probably not significant given the coarseness of the parameter sampling and the standard errors, which were about 0.05 in this experiment (they are omitted from the graph to reduce clutter). Finally, Figure 5 shows results for an off-policy learning problem, demonstrating that the gradient methods converge on a well known counterexample (Baird 1995) for which TD diverges. 8. Conclusion We have introduced two new gradient-based temporaldifference learning algorithms that minimize a natural performance measure, the projected Bellman error, and proved them convergent with linear function approximation in a general setting that includes both on-policy and off-policy learning. Both algorithms have time and memory complexity that is linear in the number of features used in the function approximation, and both are significantly faster than GTD, the only previously proposed algorithm with these properties. Moreover, the TD with gradient corrections algorithm appears to be comparable in speed to conventional linear TD on on-policy problems. This is the first time that all these desirable features--linear complexity, speed, and convergence with off-policy learning and function approximation--have been achieved in one algorithm. R EFERENCES Antos, A., Szepesv´ ri, Cs., Munos, R. (2008). Learning a near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning 71:89­129. Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th Int. Conf. on Machine Learning, pp. 30­37. Baird, L. C. (1999). Reinforcement Learning Through Gradient Descent. PhD thesis, Carnegie-Mellon University. Barnard, E. (1993). Temporal-difference methods and Markov models. IEEE Transactions on Systems, Man, and Cybernetics 23(2):357­365. Borkar, V. S. (1997). Stochastic approximation with two timescales. Systems and Control Letters 29:291-294. Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control And Optimization 38(2):447­469. Boyan, J. (2002). Technical update: Least-squares temporal difference learning. Machine Learning 49:233­246. Bradtke, S., Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning 22:33­57. Dayan, P. (1992). The convergence of TD() for general . Machine Learning 8:341­362. Geramifard, A., Bowling, M., Sutton, R. S. (2006). Incremental least-square temporal difference learning. Proceedings AAAI, pp. 356­361. Hirsch, M. W. (1989). Convergent activation dynamics in continuous time networks. Neural Networks 2:331­349. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Offpolicy temporal-difference learning with function approximation. Proceedings of the 18th International Conference on Machine Learning, pp. 417­424. Precup, D., Sutton, R. S., Paduraru, C., Koop, A., Singh, S. (2006). Off-policy learning with recognizers. Advances in Neural Information Processing Systems 18. Silver, D., Sutton, R. S., M¨ ller, M. (2007). Reinforcement u learning of local shape in the game of Go. Proceedings of the 20th IJCAI, pp. 1053­1058. Sturtevant, N. R., White, A. M. (2006). Feature construction for reinforcement learning in hearts. In Proceedings of the 5th International Conf. on Computers and Games. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning 3:9­44. Sutton, R. S., Precup D., and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112:181­211. Sutton, R. S., Szepesv´ ri, Cs., Maei, H. R. (2009). A a convergent O(n) algorithm for off-policy temporaldifference learning with linear function approximation. Advances in Neural Information Processing Systems 21. RMSPBE