Stable Dual Dynamic Programming Tao Wang Daniel Lizotte Michael Bowling Dale Schuurmans Department of Computing Science University of Alberta {trysi,dlizotte,bowling,dale}@cs.ualberta.ca Abstract Recently, we have introduced a novel approach to dynamic programming and reinforcement learning that is based on maintaining explicit representations of stationary distributions instead of value functions. In this paper, we investigate the convergence properties of these dual algorithms both theoretically and empirically, and show how they can be scaled up by incorporating function approximation. 1 Introduction Value function representations are dominant in algorithms for dynamic programming (DP) and reinforcement learning (RL). However, linear programming (LP) methods clearly demonstrate that the value function is not a necessary concept for solving sequential decision making problems. In LP methods, value functions only correspond to the primal formulation of the problem, while in the dual they are replaced by the notion of state (or state-action) visit distributions [1, 2, 3]. Despite the well known LP duality, dual representations have not been widely explored in DP and RL. Recently, we have showed that it is entirely possible to solve DP and RL problems in the dual representation [4]. Unfortunately, [4] did not analyze the convergence properties nor implement the proposed ideas. In this paper, we investigate the convergence properties of these newly proposed dual solution techniques, and show how they can be scaled up by incorporating function approximation. The proof techniques we use to analyze convergence are simple, but lead to striking conclusions. In particular, we find that the standard convergence results for value function based approaches also apply to the dual case, even in the presence of function approximation and off-policy updating. The dual approach appears to hold a significant advantage over the standard primal view of DP/RL in one major sense: since the fundamental objects being represented are normalized probability distributions (i.e., belong to a bounded simplex), dual updates cannot diverge. In particular, we find that dual updates in fact converge (i.e. avoid oscillation) in the very circumstance where primal updates can and often do diverge: gradient-based off-policy updates with linear function approximation [5, 6]. 2 Preliminaries We consider the problem of computing an optimal behavior strategy in a Markov decision process (MDP), defined by a set of actions A, a set of states S , a |S ||A| by |S | transition matrix P , a reward vector r and a discount factor , where we assume the goal is to maximize the infinite horizon t discounted reward r0 + r1 + 2 r2 + · · · = t=0 rt . It is known that an optimal behavior strategy can always be expressed by a stationary policy, whose entries (sa) specify the probability of taking action a in state s. Below, we represent a policy by an equivalent representation as an |S | × |S ||A| matrix where (s,s a) = (sa) if s = s, otherwise 0. One can quickly verify that the matrix product P gives the state-to-state transition probabilities induced by the policy in the environment P , and that P gives the state-action to state-action transition probabilities induced by policy in P . The problem is to compute an optimal policy given either (a) a complete specification of the environmental variables P and r (the "planning problem"), or (b) limited access 1 to the environment through observed states and rewards and the ability to select actions to cause further state transitions (the "learning problem"). The first problem is normally tackled by LP or DP methods, and the second by RL methods. In this paper, we will restrict our attention to scenario (a). 3 Dual Representations Traditionally, DP methods for solving the MDP planning problem are typically expressed in terms of the primal value function. However, [4] demonstrated that all the classical algorithms have natural duals expressed in terms of state and state-action probability distributions. In the primal representation, the policy state-action value function can be specified by an |S ||A| × 1 i i vector q = i=0 (P ) r which satisfies q = r + P q. To develop a dual form of stateaction policy evaluation, one considers the linear system d = (1 - ) + d P , where is the initial distribution over state-action pairs. Not only is d a proper probability distribution over state-action pairs, it also allows one to easily compute the expected discounted return of the policy . However, recovering the state-action distribution d is inadequate for policy improvement. Therefore, one considers the following |S ||A| × |S ||A| matrix H = (1 - )I + P H . The matrix H that satisfies this linear relation is similar to d , in that each row is a probability distribution and the entries H(sa,s a ) correspond to the probability of discounted state-action visits to (s a ) for a policy starting in state-action pair (sa). Unlike d , however, H drops the dependence on , giving (1 - )q = H r. That is, given H we can easily recover the state-action values of . For policy improvement, in the primal representation one can derive an improved policy via the update a (s) = arg maxa q(sa) and (sa) = 1 if a = a (s), otherwise 0. The dual form of the policy update can be expressed in terms of the state-action matrix H for is a (s) = arg maxa H(sa,:) . In fact, since (1 - )q = H r, the two policy updates given in the primal and dual respectively, must lead to the same resulting policy . Further details are given in [4]. 4 DP algorithms and convergence We first investigate whether dynamic programming operators with the dual representations exhibit the same (or better) convergence properties to their primal counterparts. These questions will be answered in the affirmative. In the tabular case, dynamic programming algorithms can be expressed by operators that are successively applied to current approximations (vectors in the primal case, matrices in the dual), to bring them closer to a target solution; namely, the fixed point of a desired Bellman equation. Consider two standard operators, the on-policy update and the max-policy update. For a given policy , the on-policy operator O is defined as Oq = r + P q an d OH = (1 - )I + P H, for the primal and dual cases respectively. The goal of the on-policy update is to bring current representations closer to satisfying the policy-specific Bellman equations, q = r + P q an d H = (1 - )I + P H The max-policy operator M is different in that it is neither linear nor defined by any reference policy, but instead applies a greedy max update to the current approximations Mq = r + P [q] an d MH = (1 - )I + P [H ], sr where [q](s) = maxa q(sa) and r [H ](s) = maxa [H r](sa) = maxa a H(sa,s a ) r(s a ) . The goal of this greedy update is to bring the representations closer to satisfying the optimal-policy Bellman equations q = r + P [q] and H = (1 - )I + P [H ]. r 4.1 On-policy convergence For the on-policy operator O, convergence to the Bellman fixed point is easily proved in the primal case, by establishing a contraction property of O with respect to a specific norm on q vectors. In particular, one defines a weighted 2-norm with weights given by the stationary distribution determined by the policy with respect to the transition model P : Let z 0 be a vector such that z P = z ; that is, z is the stationary state-action visit distribution for P . Then the ( 2 2 norm is defined as q z = q Z q = sa) z(sa) q(sa) , where Z = diag(z). It can be shown that 2 P q z q z and Oq1 - Oq2 z q1 - q2 z (see [7]). Crucially, for this norm, a stateaction transition is not an expansion [7]. By the contraction map fixed point theorem [2] there exists a unique fixed point of O in the space of vectors q. Therefore, repeated applications of the on-policy operator converge to a vector q such that q = Oq ; that is, q satisfies the policy based Bellman equation. Analogously, for the dual representation H , one can establish convergence of the on-policy operator by first defining an approximate weighted norm over matrices and then verifying that O is a contraction with respect to this norm. Define ( ( 2 2 H(sa,s a ) r(s a ) )2 (1 ) z(sa) ( Hr z = H z,r = sa) s a ) It is easily verified that this definition satisfies the property of a pseudo-norm, and in particular, satisfies the triangle inequality. This weighted 2-norm is defined with respect to the stationary distribution z, but also the reward vector r. Thus, the magnitude of a row normalized matrix is determined by the magnitude of the weighted reward expectations it induces. Interestingly, this definition allows us to establish the same non-expansion and contraction results as the primal case. We can have P H z,r H z,r by arguments similar to the primal case. Moreover, the on-policy operator is a contraction with respect to · z,r . Lemma 1 Proof: H z,r . OH1 - OH2 z,r H1 - H2 z,r OH1 - OH2 z,r = P (H1 - H2 ) z,r H1 - H2 z,r since P H z,r Thus, once again by the contraction map fixed point theorem there exists a fixed point of O among row normalized matrices H , and repeated applications of O will converge to a matrix H such that OH = H ; that is, H satisfies the policy based Bellman equation for dual representations. This argument shows that on-policy dynamic programming converges in the dual representation, without making direct reference to the primal case. We will use these results below. 4.2 Max-policy convergence The strategy for establishing convergence for the nonlinear max operator is similar to the on-policy case, but involves working with a different norm. Instead of considering a 2-norm weighted by the visit probabilities induced by a fixed policy, one simply uses the max-norm in this case: q = max(sa) |q(sa) |. The contraction property of the M operator with respect to this norm can then be easily established in the primal case: Mq1 - Mq2 q1 - q2 (see [2]). As in the on-policy case, contraction suffices to establish the existence of a unique fixed point of M among vectors q, and that repeated application of M converges to this fixed point q such that Mq = q . To establish convergence of the off-policy update in the dual representation, first define the maxnorm for state-action visit distribution as ( H(sa,s a ) r(s a ) | (2 ) H = max | (sa) s a ) Then one can simply reduce the dual to the primal case by appealing to the relationship (1- )Mq = MH r to prove convergence of MH . Lemma 2 If (1 - )q = H r, then (1 - )Mq = MH r. Proof: (1 - )Mq = (1 - )(r + P [q]) = (1 - )r + P [H r] = MH r where the second equality holds since (1 - )q(sa) = [H r](sa) for all (sa) by assumption. Thus, given convergence of Mq to a fixed point Mq = q , the same must also hold for MH . However, one subtlety here is that the dual fixed point is not unique. This is not a contradiction because the norm on dual representations · z,r is in fact just a pseudo-norm, not a proper norm. That is, the relationship between H and q is many to one, and several matrices can correspond to the same q. These matrices form a convex subspace (in fact, a simplex), since if H1 r = (1 - )q and H2 r = (1 - )q then (H1 + (1 - )H2 )r = (1 - )q for any , where furthermore must be restricted to 0 1 to maintain nonnegativity. The simplex of fixed points {H : MH = H } is given by matrices H that satisfy H r = (1 - )q . 3 5 DP with function approximation Primal and dual updates exhibit strong equivalence in the tabular case, as they should. However, when we begin to consider approximation, differences emerge. We next consider the convergence properties of the dynamic programming operators in the context of linear basis approximation. We focus on the on-policy case here, because, famously, the max operator does not always have a fixed point when combined with approximation in the primal case [8], and consequently suffers the risk of divergence [5, 6]. Note that the max operator cannot diverge in the dual case, even with basis approximation, by boundedness alone; although the question of whether max updates always converge in this case remains open. Here we establish that a similar bound on approximation error in the primal case can be proved for the dual approach with respect to the on-policy operator. In the primal case, linear approximation proceeds by fixing a small set of basis functions, forming a |S ||A| × k matrix , where k is the number of bases. The approximation of q can be expressed ^ by a linear combination of bases q = w where w is a k × 1 vector of adjustable weights. This is ^ equivalent to maintaining the constraint that q col span(). In the dual, a linear approximation ^ to H can be expressed as vec(H ) = w, where the vec operator creates a column vector from a matrix by stacking the column vectors of the matrix below one another, w is a k × 1 vector of adjustable weights as it is in the primal case, and is a (|S ||A|)2 × k matrix of basis functions. To ^ ^ ensure that H remains a nonnegative, row normalized approximation to H , that is, H simplex() , 1 ^ ^ {H : vec(H ) = w, 0,(1 I ) = 11 w 0, w = 1} where the operator is the Kronecker product. In this section, we first introduce operators (projection operator and gradient operator) that ensure the approximations stay representable in the given bases. Then we consider their composition with the on-policy update and off-policy update, and analyze their convergence properties. For the composition of the on-policy update and projection operators, we establish a similar bound on approximation error in the dual case as in the primal case. 5.1 Projection Operator Recall that in the primal, the action value function q is approximated by a linear combination of bases in . Unfortunately, there is no reason to expect Oq or Mq to stay in the column span of , so a best approximation is required. The subtlety resolved by Tsitsiklis and Van Roy [7] is to identify a particular form of best approximation--weighted least squares--that ensures convergence is still achieved when combined with the on-policy operator O. Unfortunately, the fixed point of this combined update operator is not guaranteed to be the best representable approximation of O's fixed point, q . Nevertheless, a bound can be proved on how close this altered fixed point is to the best representable approximation. We summarize a few details that will be useful below: First, the best least squares approximation is computed with respect to the distribution z. The map from a general q vector onto its best approximation in col span() is defined by another operator, P , which projects q into the column ^ ^2 span of , P q = argminqcol span() q - q z = ( Z )-1 Z q, where q is an approx^ imation for value function q. The important property of this weighted projection is that it is a non-expansion operator in · z , i.e., Pq z q z , which can be easily obtained from the generalized Phythagorean theorem. Approximate dynamic programming then proceeds by composing the two operators--the on-policy update O with the subspace projection P -- essentially computing the best representable approximation of the one step update. This combined operator is guaranteed to converge, since composing a non-expansion with a contraction is still a contraction, that is, 1 q+ - q z 1- q - P q z [7]. Linear function approximation in the dual case is a bit more complicated because matrices are being represented, not vectors, and moreover the matrices need to satisfy row normalization and nonnegativity constraints. Nevertheless, a very similar approach to the primal case can be successfully applied. Recall that in the dual, the state-action visit distribution H is approximated by a linear combination of bases in . As in the primal case, there is no reason to expect that an update like OH should keep the matrix in the simplex. Therefore, a projection operator must be constructed 4 that determines the best representable approximation to OH . One needs to be careful to define this projection with respect to the right norm to ensure convergence. Here, the pseudo-norm · z,r defined in Equation 1 suits this purpose. Define the weighted projection operator P over matrices ^ PH = argmin H - H z,r2 (3 ) ^ H simplex() The projection could be obtained by solving the above quadratic program. A key result is that this projection operator is a non-expansion with respect to the pseudo-norm · z,r . Theorem 1 PH z,r H z,r Proof: The easiest way to prove the theorem is to observe that the projection operator P is really a composition of three orthogonal projections: first, onto the linear subspace span(), then onto the subspace of row normalized matrices span() {H : H 1 = 1}, and finally onto the space of nonnegative matrices span() {H : H 1 = 1} {H : H 0}. Note that the last projection into the nonnegative halfspace is equivalent to a projection into a linear subspace for some hyperplane tangent to the simplex. Each one of these projections is a non-expansion in · z,r in the same way: a generalized Pythagorean theorem holds. Consider just one of these linear projections P1 2 2 2 H z,r = P1 H + H - P1 H z,r = P1 H r + H r - P1 H r z = P1 H r z + Hr - P1 H r z = P1 H z,r + H - P1 z,r Since the overall projection is just a composition of non-expansions, it must be a non-expansion. As in the primal, approximate dynamic programming can be implemented by composing the onpolicy update O with the projection operator P . Since O is a contraction and P a non-expansion, P O must also be a contraction, and it then follows that it has a fixed point. Note that, as in the tabular case, this fixed point is only unique up to H r-equivalence, since the pseudo-norm · z,r does not distinguish H1 and H2 such that H1 r = H2 r. Here too, the fixed point is actually a simplex of equivalent solutions. For simplicity, we denote the simplex of fixed points for P O by some representative H+ = P OH+ . Finally, we can recover an approximation bound that is analogous to the primal bound, which bounds the approximation error between H+ and the best representable approximation to the on-policy fixed point H = OH . Theorem 2 H+ - H z,r 1 1- 2 2 2 2 PH - H z,r Proof: First note that H+ - H z,r = H+ - P H + P H - H z,r H+ - P H z,r + PH - H z,r by generalized Phythagorean theorem. Notice that H+ = P OH+ and P is a nonexpansion operator, we have H+ - P H z,r = POH+ - P H z,r OH+ - H z,r . Since H = OH and by Lemma 1, OH+ - H z,r = OH+ - OH z,r H+ - H z,r . Thus (1 - ) H+ - H z,r PH - H z,r . To compare the primal and dual results, note that despite the similarity of the bounds, the projection operators do not preserve the tight relationship between primal and dual updates. That is, even if (1 - )q = H r and (1 - )(Oq) = (OH )r, it is not true in general that (1 - )(P Oq) = (P OH )r. The most obvious difference comes from the fact that in the dual, the space of H matrices has bounded diameter, whereas in the primal, the space of q vectors has unbounded diameter in the natural norms. Automatically, the dual updates cannot diverge with compositions like P O and P M; yet, in the primal case, the update P M is known to not have fixed points in some circumstances [8]. 5.2 Gradient Operator In large scale problems one does not normally have the luxury of computing full dynamic programming updates that evaluate complete expectations over the entire domain, since this requires knowing the stationary visit distribution z for P (essentially requiring one to know the model of the MDP). Moreover, full least squares projections are usually not practical to compute. A key intermediate step toward practical DP and RL algorithms is to formulate gradient step operators that only approximate full projections. Conveniently, the gradient update and projection operators are independent of the on-policy and off-policy updates and can be applied in either case. However, 5 as we will see below, the gradient update operator causes significant instability in the off-policy update, to the degree that divergence is a common phenomenon (much more so than with full projections). Composing approximation with an off-policy update (max operator) in the primal case is very dangerous! All other operator combinations are better behaved in practice, and even those that are not known to converge usually behave reasonably. Unfortunately, composing the gradient step with an off-policy update is a common algorithm attempted in reinforcement learning (Q-learning with function approximation), despite being the most unstable. In the dual representation, one can derive a gradient update operator in a similar way to the primal, except that it is important to maintain the constraints on the parameters w, since the basis functions are probability distributions. We start by considering the projection objective 2 1 ^ ^ JH = H - H z,r subject to vec(H ) = w, w 0, w 1 = 1 2 The unconstrained gradient of the above objective with respect to w is ^ wJH = (r I ) Z (r I )(w - h) = Z (r I )(h - h) ^ ^ where = (r I ), h = vec(H ), and h = vec(H ). However, this gradient step cannot be followed directly because we need to maintain the constraints. The constraint w 1 = 1 can be 1 maintained by first projecting the gradient onto it, obtaining w = (I - k 11 ) wJH . Thus, the weight vector can be updated by 1 ^ wt+1 = wt - w = wt - (I - 11 ) Z (r I )(h - h) k where is a step-size parameter. Then the gradient operator can then be defined by 1 ^ ^ ^ Gh h = h - w = h - (I - 11 ) Z (r I )(h - h) ^ k (Note that to further respect the box constraints, 0 h 1, the stepsize might need to be reduced and additional equality constraints might have to be imposed on some of the components of h that are at the boundary values.) Similarly as in the primal, since the target vector H (i.e., h) is determined by the underlying dynamic programming update, this gives the composed updates 1 1 ^ ^ Gh Oh = (I - 11 ) Z (r I )(h - Oh) and Gh Mh = (I - 11 ) (r I )(h - Mh) ^ ^ k k respectively for the on-policy and off-policy cases (ignoring the additional equality constraints). ^ Iterations that attempt to optimize h in the on-policy and off-policy cases respectively are given by ^ ^ ^ ^ an d ht+1 = Ght Mht ht+1 = Ght Oht ^ ^ Thus far, the dual approach appears to hold an advantage over the standard primal approach, since convergence holds in every circumstance where the primal updates converge, and yet the dual updates are guaranteed never to diverge because the fundamental objects being represented are normalized probability distributions (i.e., belong to a bounded simplex). We now investigate the convergence properties of the various updates empirically. 6 Experimental Results To investigate the effectiveness of the dual representations, we conducted experiments on various domains, including randomly synthesized MDPs, on the star problem [5], and on the mountain car problem. The randomly synthesized MDP domains allow us to test the general properties of the algorithms. The star problem is perhaps the most-cited example of a problem where Q-learning with linear function approximation diverges [5], and the mountain car domain has been prone to divergence with some primal representations [9] although successful results were reported when bases are selected by sparse tile coding [10]. For each problem domain, twelve algorithms were run over 100 repeats with a horizon of 1000 steps. The algorithms were: tabular on-policy (O), projection on-policy (P O), gradient on-policy (G O), tabular off-policy (M), projection off-policy (P M), and gradient off-policy (G M), for both the primal and the dual. The discount factor was set to = 0.9. For on-policy algorithms, we measure the difference between the values generated by the algorithms and those generated by the analytically determined fixed-point. For off-policy algorithms, we measure the difference between the values 6 generated by the resulting policy and the values of the optimal policy. The step size for the gradient updates was 0.1 for primal representations and 100 for dual representations. The initial values of state-action value functions q are set according to the standard normal distribution, and state-action visit distributions H are chosen uniformly randomly with row normalization. Since the goal is to investigate the convergence of the algorithms without carefully crafting features, we also choose random basis functions according to a standard normal distribution for the primal representations, and random basis distributions according to a uniform distribution for the dual representations. Randomly Synthesized MDPs. For the synthesized MDPs, we generated the transition and reward functions of the MDPs randomly--the transition function is uniformly distributed between 0 and 1 and the reward function is drawn from a standard normal. Here we only reported the results of random MDPs with 100 states, 5 actions, and 10 bases, observed consistent convergence of the dual representations on a variety of MDPs, with different numbers of states, actions, and bases. In Figure 1(right), the curve for the gradient off-policy update (G M) in the primal case (dotted line with the circle marker) blows up (diverges), while all the other algorithms in Figure 1 converge. Interestingly, the approximate error of the dual algorithm P OH (4.60 × 10-3) is much smaller than the approximate error of the corresponding primal algorithm P Oq (4.23 × 10-2), even though their theoretical bounds are the same (see Figure 1(left)). On-Policy Update on Random MDPs 10 10 Off-Policy Update on Random MDPs Oq P Oq 10 10 M q PM q Difference from Reference Point 10 5 Difference from Reference Point G Oq OH P OH G OH 10 0 GM q 10 5 M H PM H GM H 10 0 10 -5 10 -5 10 -10 100 200 300 400 500 600 700 800 900 1000 10 -10 100 200 300 400 500 600 700 800 900 1000 Number of Steps Number of Steps Figure 1: Updates of state-action value q and visit distribution H on randomly synthesized MDPs The Star Problem. The star problem has 7 states and 2 actions. The reward function is zero for each transition. In these experiments, we used the same fixed policy and linear value function approximation as in [5]. In the dual, the number of bases is also set to 14 and the initial values of the state-action visit distribution matrix H are uniformly distributed random numbers between 0 and 1 with row normalization. The gradient off-policy update in the primal case diverges (see the dotted line with the circle marker in Figure 2(right)). However, all the updates with the dual representation algorithms converge. On-Policy Update on Star Problem 10 10 Off-Policy Update on Star Problem Oq P Oq 10 10 M q PM q Difference from Reference Point 10 5 Difference from Reference Point G Oq OH P OH G OH 10 0 GM q 10 5 M H PM H GM H 10 0 10 -5 10 -5 10 -10 100 200 300 400 500 600 700 800 900 1000 10 -10 100 200 300 400 500 600 700 800 900 1000 Number of Steps Number of Steps Figure 2: Updates of state-action value q and visit distribution H on the star problem 7 The Mountain Car Problem The mountain car domain has continuous state and action spaces, which we discretized with a simple grid, resulting in an MDP with 222 states and 3 actions. The number of bases was chosen to be 5 for both the primal and dual algorithms. For the same reason as before, we chose the bases for the algorithms randomly. In the primal representations with linear function approximation, we randomly generated basis functions according to the standard normal distribution. In the dual representations, we randomly picked the basis distributions according to the uniform distribution. In Figure 3(right), we again observed divergence of the gradient off-policy update on state-action values in the primal, and the convergence of all the dual algorithms (see Figure 3). Again, the approximation error of the projected on-policy update P OH in the dual (1.90 × 101) is also considerably smaller than P Oq (3.26 × 102) in the primal. On-Policy Update on Mountain Car 10 10 Off-Policy Update on Mountain Car Oq P Oq 10 10 M q PM q Difference from Reference Point 10 5 Difference from Reference Point G Oq OH P OH G OH 10 0 GM q 10 5 M H PM H GM H 10 0 10 -5 10 -5 10 -10 100 200 300 400 500 600 700 800 900 1000 10 -10 100 200 300 400 500 600 700 800 900 1000 Number of Steps Number of Steps Figure 3: Updates of state-action value q and visit distribution H on the mountain car problem 7 Conclusion Dual representations maintain an explicit representation of visit distributions as opposed to value functions [4]. We extended the dual dynamic programming algorithms with linear function approximation, and studied the convergence properties of the dual algorithms for planning in MDPs. We demonstrated that dual algorithms, since they are based on estimating normalized probability distributions rather than unbounded value functions, avoid divergence even in the presence of approximation and off-policy updates. Moreover, dual algorithms remain stable in situations where standard value function estimation diverges. References [1] M. Puterman. Markov Decision Processes: Discrete Dynamic Programming. Wiley, 1994. [2] D. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena Scientific, 1995. [3] D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. [4] T. Wang, M. Bowling, and D. Schuurmans. Dual representations for dynamic programming and reinforcement learning. In Proceeding of the IEEE International Symposium on ADPRL, pages 44­51, 2007. [5] L. C. Baird. Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning, pages 30­37, 1995. [6] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. [7] J. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674­690, 1997. [8] D. de Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. J. Optimization Theory and Applic., 105(3):589­608, 2000. [9] J. A. Boyan and A. W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In NIPS 7, pages 369­376, 1995. [10] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems, pages 1038­1044, 1996. 8