An Analysis of Reinforcement Learning with Function Approximation Francisco S. Melo Carnegie Mellon University, Pittsburgh, PA 15213, USA Sean P. Meyn Coordinated Science Lab, Urbana, IL 61801, USA M. Isab el Rib eiro Institute for Systems and Robotics, 1049-001 Lisboa, Portugal fmelo@cs.cmu.edu meyn@control.csl.uiuc.edu mir@isr.ist.utl.pt Abstract We address the problem of computing the optimal Q-function in Markov decision problems with infinite state-space. We analyze the convergence properties of several variations of Q-learning when combined with function approximation, extending the analysis of TD-learning in (Tsitsiklis & Van Roy, 1996a) to stochastic control settings. We identify conditions under which such approximate methods converge with probability 1. We conclude with a brief discussion on the general applicability of our results and compare them with several related works. study a variation of Q-learning using importance sampling and an on-policy variant of Q-learning (SARSA). The paper is organized as follows. We start in Section 2 by describing Markov decision problems. We proceed with our analysis of the Q-learning algorithm and its variants, and produce our main results in Section 3. We also compare our results with other related works in the RL literature. We conclude with some further discussion in Section 4. 2. Markov Decision Problems Let (X , A , P , r , ) be a Markov decision problem (MDP) with a compact state-space X Rp and a finite action set A. The action-dependent kernel Pa defines the transition probabilities for the underlying controlled Markov chain {Xt } as P [Xt+1 U | Xt = x, At = a] = Pa (x, U ), where U is any measurable subset of X . The A-valued process {At } represents the control process: At is the control action at time instant t.1 Solving the MDP consists in determining the control process {At } maximizing the expected total discounted reward , t { = t V At } , x E R(Xt , At ) | X0 = x =0 1. Intro duction Convergence of Q-learning with function approximation has been a long standing open question in reinforcement learning (Sutton, 1999). In general, valuebased reinforcement learning (RL) methods for optimal control behave poorly when combined with function approximation (Baird, 1995; Tsitsiklis & Van Roy, 1996a). In this paper, we address this problem by analyzing the convergence of Q-learning when combined with linear function approximation. We identify a set of conditions that imply the convergence of this approximation method with probability 1 (w.p.1), when a fixed learning policy is used, and provide an interpretation of the resulting approximation as the fixed point of a Bellman-like operator. This motivates the analysis of several variations of Q-learning when combined with linear function approximation. In particular, we Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). where 0 < 1 is a discount-factor and R(x, a) represents a random "reward" received for taking action a A in state x X . For simplicity of notation, we consider a bounded deterministic function r : X × A × X - R assigning a reward r(x, a, y ) every time a transition from x to y occurs after taking 1 We take the control process {At } to be adapted to the -algebra induced by {Xt }. An Analysis of Reinforcement Learning with Function Approximation action a. This means that X E [R(x, a)] = r(x, a, y )Pa (x, dy ). The optimal value function V is defined for each state x X as { = V (x) = max V At } , x {At } a t = max E t R(Xt , At ) | X0 = x {At } =0 The function Q in (2) is the fixed-point of H and, since this operator is a contraction in the sup-norm, a fixed-point iteration can be used to determine Q (at least theoretically). 2.1. The Q-Learning Algorithm We previously suggested that a fixed-point iteration could be used to determine the function Q . In practice, this requires two important conditions: · The kernel P and the reward function r are known; · The successive estimates for Q can be represented compactly and stored in a computer with finite memory. If P and/or r are not known, a fixed-point iteration using H is not possible. To solve this problem, Watkins proposed in 1989 the Q-learning algorithm (Watkins, 1989). Q-learning proceeds as follows: consider a MDP M = (X , A , P , r , ) and suppose that {xt } is an infinite sample tra jectory of the underlying Markov chain obtained with some policy t . The corresponding sample control process is denoted as {at } and the sequence of obtained rewards as {rt }. Given any initial estimate Q0 , Q-learning successively updates this estimate using the rule Qt+1 (x, a) = Qt (x, a) + t (x, a)t , (4) nd verifies the Bellman optimality equation X r P V (x) = max (x, a, y ) + V (y ) a (x, dy ). (1) aA V (x) represents the expected total discounted reward received along an optimal tra jectory starting at state x. We can also define the optimal Q-function Q as X r P Q (x, a) = (x, a, y ) + V (y ) a (x, dy ), (2) representing the expected total discounted reward along a tra jectory starting at state x obtained by choosing a as the first action and following the optimal policy thereafter. The control process {At } defined as At = arg max Q (Xt , a), aA t, is optimal in the sense that V ({At } , x) = V (x) and defines a mapping : X A known as the optimal policy. The optimal policy determines the optimal decision rule for a given MDP. More generally, a (Markov) policy is any mapping t defined over X × A generating a control process {At } verifying, for all t, P [At = a | Xt = x] = t (x, a), t. where {t } is a step-size sequence and t is the temporal difference at time t, t = rt + max Qt (xt+1 , b) - Qt (xt , at ). bA (5) We write V t (x) instead of V ({At } , x) if the control process {At } is generated by policy t . A policy t is stationary if it does not depend on t and deterministic if it assigns probability 1 to a single action in each state and is thus represented as a map t : X A for every t. Notice that the optimal control process can be obtained from the optimal (stationary, deterministic) policy , which can in turn be obtained from Q . Therefore, the optimal control problem is solved once the function Q is known for all pairs (x, a). Now given any real function q defined over X × A, we define the Bellman operator X r P (Hq )(x, a) = (x, a, y ) + max q (y , u) a (x, dy ). uA If both X and A are finite sets, each estimate Qt is simply a |X | × |A| matrix and can be represented explicitly in a computer. In that case, the convergence of Q-learning and several other related algorithms (such as TD() or SARSA) has been thoroughly studied (see, for example, (Bertsekas & Tsitsiklis, 1996) and references therein). However, if either X or A are infinite or very large, explicitly representing each Qt becomes infeasible and some form of compact representation is needed (e.g., using function approximation). In this paper, we address how several RL methods such as Qlearning and SARSA can be combined with function approximation and still retain their main convergence properties. 3. Reinforcement Learning with Linear Function Approximation In this section, we address the problem of determining the optimal Q-function for MDPs with infinite statespace X . Let Q = {Q } be a family of real-valued (3) An Analysis of Reinforcement Learning with Function Approximation functions defined in X × A. It is assumed that the function class is linearly parameterized, so that Q can be expressed as the linear span of a fixed set of M linearly independent functions i : X × A R. For each M -dimensional parameter vector RM , the function Q Q is defined as, Q (x, a) = iM =1 r where epresents the transpose operator. We will also denote the above function by Q() to emphasize the dependence on over the dependency on (x, a). i (x, a)(i) = ( x, a), M = (X , A , P , r , ) with compact state space X Rp , let (X , P ) be the Markov chain induced by a fixed policy . We assume the chain (X , P ) to be uniformly ergodic with invariant probability measure µX and the policy to verify (x, a) > 0 for all a A and µX almost all x X . We denote by µ the probability measure defined for each measurable set U X and each action a A as U µ (U × {a}) = (x, a)µX (dx). Let now {i , i = 1, . . . , M } be a set of bounded, linearly independent basis functions to be used in our approximate Q-learning algorithm. We denote by the matrix defined as X = ( = E (x, a) x, a) dµ ×A Let be a fixed stochastic, stationary policy and suppose that {xt }, {at } and {rt } are sampled tra jectories of states, actions and rewards obtained from the MDP using policy . In the original Q-learning algorithm, the Q-values are updated according to (4). The temporal difference t can be interpreted as a 1-step estimation error with respect to the optimal function Q . The update rule in Q-learning "moves" the estimates Qt closer to the desired function Q , minimizing the expected value of t . In our approximate setting, we apply the same underlying idea to obtain the update rule for approximate Q-learning: t+1 = t + t Q (xt , at )t = t + t (xt , at )t , (6) Notice that the above expression is well-defined and independent of the initial distribution for the chain, due to our assumption of uniform ergodicity. For fixed RM and x X , define the set of maximizing actions at x as A = {a A | x ( x, a ) = max a ( x, a)} where, as above, t is the temporal difference at time t defined in (5). Notice that (6) updates t using the temporal difference t as the error. The gradient Q provides the "direction" in which this update is performed. To establish convergence of the algorithm (6) we adopt an ODE argument, establishing the tra jectories of the algorithm to closely follow those of an associated ODE with a globally asymptotically stable equilibrium point. This will require several regularity properties on the policy and on its induced Markov chain that will, in turn, motivate the study of the on-policy version of the algorithm. This on-policy algorithm can be seen as an extension of SARSA to infinite settings. 3.1. Convergence of Q-Learning We now proceed by identifying conditions that ensure the convergence of Q-learning with linear function approximation as described by (6). Due to space limitations, we overlook some of the technical details in the proofs that can easily be filled in. We start by introducing some notation that will greatly simplify the presentation. Given an MDP and the greedy policy with respect to as any policy that, at each state x, assigns positive probability only to actions in A . Finally, let x denote the row-vector x (x, a), where a is a random action generated according to the policy at state x; likewise, let denote x the row-vector (x, a ), where a is now any action x x in A . We now introduce the -dependent matrix x . () = E x x By construction, both and are positive defi nite, since the functions i are assumed linearly independent. Notice also the difference between and each : the actions in the definition of the former are taken according to while in the latter they are taken greedily with respect to a particular . We are now in position to introduce our first result. Theorem 1 Let M, and {i , i = 1, . . . , M } be as defined above. If, for al l , > 2 ( ) and the step-size sequence verifies t t 2 t = t < , (7) An Analysis of Reinforcement Learning with Function Approximation then the algorithm in (6) converges w.p.1 and the limit point verifies the recursive relation Q( ) = Q HQ( ), where Q is the orthogonal projection onto Q.2 tations above, we get d ~ ~ 2 -2 ~2 dt E( + E( ~ ~ + 2 x )2 IS+ 1 )2 IS+ x E( E( ~ ~ 2 x )2 IS- 2 )2 IS- x nd a few simple computations finally yield d ~ 2 -2 ~2 dt + 2 a Proof We establish the main statement of the theorem using a standard ODE argument. The assumptions on the chain (X , P ) and basis functions {i , i = 1, . . . , M } and the fact that (x, a) > 0 for all a A and µX -almost every x X ensure the applicability of Theorem 17 in page 239 of (Benveniste et al., 1990). Therefore, the convergence of the algorithm can be analyzed in terms of the stability of the equilibrium points of the associated ODE = E x ~ max ~ ~ ~ ( ) , ( ) ). ~~ ~ 1 2 Since, by assumption, > 2 (), we can conclude from the expression above that d 2 < 0. ~2 dt ~ This means, in particular, that (t) converges asymptotically to the origin, i.e., the ODE (8) is globally asymptotically stable. Since the ODE is autonomous (i.e., time-invariant), there exists one globally asymptotically stable equilibrium point for the ODE, that verifies the recursive relation r . = -1 E x (x, a, y ) + (9) y Since is, by construction, positive definite, the inverse in (9) is well-defined. Multiplying (9) by (x, a) I on both sides yields the desired result. t is now important to observe that condition (7) is quite restrictive: since is usually taken close to 1, condition (7) essentially requires that, for every , a max (x, a) (x, a) (x, a). aA A r (x, a, y ) + - x y , (8) where we omitted the explicit dependence of on t to avoid excessively cluttering the expression. If the ODE (8) has a globally asymptotically stable equilibrium point, this implies the algorithm (6) to converge w.p.1 (Benveniste et al., 1990). Let then 1 (t) and 2 (t) be two tra jectories of the ODE starting at different initial ~ conditions, and let (t) = 1 (t) - 2 (t). From (8), we get d ~ 2 = -2 ~2 dt ~ + 2 E x ~ 1 y 1 - 2 2 y . Notice now that, from the definition of 1 and 2 , y y 1 2 2 2 y y 2 1 1 1 . y y Taking this into account and defining the sets S+ = ~ {(x, a) | (x, a) > 0} and S- = X × A - S+ , the previous expression becomes + d ~ ~ 2 -2 + 2 E ~2 dt 2 E x ~ ~ 1 y ~ ~ I S+ x 2 y I , S- where IS represents the indicator function for the set S . Applying Hölder's inequality to each of the expec2 The orthogonal pro jection is naturally defined in the (infinite-dimensional) Hilbert space containing Q with inner-product given by Therefore, such condition will seldom be met in practice, since it implies that the learning policy is already close to the policy that the algorithm is meant to compute. In other words, the maximization above yields a policy close to the policy used during learning. And, when this is the case, the algorithm essentially behaves like an on-policy algorithm. On the other hand, the above condition can be ensured by considering only a local maximization around the learning policy . This is the most interesting aspect of the above result: it explicitly relates how much information the learning policy provides about greedy policies, as a function of . To better understand this, X f, g = ×A f g dµ . An Analysis of Reinforcement Learning with Function Approximation notice that each policy is associated with a particular invariant measure on the induced chain (X , P ). In particular, the measure associated with the learning policy may be very different from the one induced by the greedy/optimal policy. Taking into account the fact that measures, in a sense, the "importance of the future", Theorem 1 basically states that: In problems where the performance of the agent greatly depends on future rewards ( 1), the information provided by the learning policy can only be "safely generalized" to nearby greedy policies, in the sense of (7). In problems where the performance of the agent is less dependent on future rewards ( 1), the information provided by the learning policy can be safely generalized to more general greedy policies. Suppose then that the maximization in the update equation (6) is to be replaced by a local maximization. In other words, instead of maximizing over all actions in A, the algorithm should maximize over a small neighborhood of the learning policy (in policy space). The difficulty with this approach is that such maximization can be hard to implement. The use of importance sampling can readily overcome such difficulty, by making the maximization in policy-space implicit. The algorithm thus obtained, which resembles in many aspects the one proposed in (Precup et al., 2001), is described by the update rule ^ t+1 = t + t (xt , at )t , (10) Algorithm 1 Modified Q-learning. Require: Initial policy 0 ; 1: Initialize X0 = x0 and set N = 0; 2: for t = 0 until T do 3: Sample At N (xt , ·); 4: Sample next-state Xt+1 Pat (xt , ·); 5: rt = r(xt , at , xt+1 ); 6: Update t according to (10); 7: end for 8: Set N +1 (x, a) = (x, a); 9: N = N + 1; 10: if N - N -1 then 11: return N ; 12: else 13: Goto 2; 14: end if on-policy algorithms. We analyze the convergence of SARSA when combined with linear function approximation. In our main result, we recover the essence of the result in (Perkins & Precup, 2003), although in a somewhat different setting. The main differences between our work and that in (Perkins & Precup, 2003) are discussed further ahead. Once again, we consider a family Q of real-valued functions, the linear span of a fixed set of M linearly independent functions i : X × A R, and derive an on-policy algorithm to compute a parameter vector such that (x, a) approximates the optimal Qfunction. To this purpose, and unlike what has been done so far, we now consider a -dependent learning policy verifying (x, a) > 0 for all . In particular, we consider at each time step a learning policy t that is -greedy with respect to (x, a)t and Lipschitz continuous with respect to , with Lipschitz constant C (with respect to some preferred metric). We further assume that, for every fixed , the Markov chain (X , P ) induced by such policy is uniformly ergodic. Let then {xt }, {at } and {rt } be sampled tra jectories of states, actions and rewards obtained from the MDP M = (X , A , P , r , ) using (at each time-step) the dependent policy t . The update rule for our approximate SARSA algorithm is: t+1 = t + t (xt , at )t , where t is the temporal difference at time t, t = rt + ( ^ where the modified temporal difference t is given by b (xt+1 , b) Q (xt+1 , b) - Qt (xt , at ), (xt+1 , b) t (11) where is, for example, a -dependent -greedy policy close to the learning policy .3 A possible implementation of such algorithm is sketched in Figure 1. We denoted by N the behavior policy at iteration N of the algorithm; in the stopping condition for the algorithm, any adequate policy norm can be used. ^ t = rt + 3.2. SARSA with Linear Function Approximation The analysis in the previous subsection suggests that on-policy algorithms may potentially yield more reliable convergence properties. Such fact has already been observed in (Tsitsiklis & Van Roy, 1996a; Perkins & Pendrith, 2002). In this subsection we thus focus on Given > 0, a policy is -greedy with respect to a function Q Q if, at each state x X , it chooses a random action with probability and a greedy action a A with probability (1 - ). x 3 (12) xt+1 , at+1 )t - ( xt , at )t . In order to use the SARSA algorithm above to approximate the optimal Q-function, it is necessary to slowly decay the exploration rate, , to zero, while guaran- An Analysis of Reinforcement Learning with Function Approximation teeing the learning policy to verify the necessary regularity conditions (namely, Lipschitz continuous w.r.t. ). However, as will soon become apparent, decreasing the exploration rate to zero will render our convergent result (and other related results) not applicable. We are now in position to introduce our main result. Theorem 2 Let M, t and {i , i = 1, . . . , M } be as defined above. Let C be the Lipschitz constant of the learning policy with respect to . Assume that the step-size sequence verifies t t 2 t = t < . Then, there is C0 > 0 such that, if C < C0 , the algorithm in (12) converges w.p.1. Proof We again use an ODE argument to establish the statement of the theorem. As before, the assumptions on (X , P ) and basis functions {i , i = 1, . . . , M } and the fact that the learning policy is Lipschitz continuous with respect to and verifies (x, a) > 0 ensure the applicability of Theorem 17 in page 239 of (Benveniste et al., 1990). Therefore, the convergence of the algorithm can be analyzed in terms of the stability of the associated ODE: r . = E x (x, a, y ) + y - x (13) Notice that the expectation is taken with respect to the invariant measure of the chain and learning policy, both -dependent. To establish global asymptotic stability, we re-write (13) as (t) = A (t) + b where A = E x y - x ; b = E x r(x, a, y ) . the regular Euclidian norm. The previous expression thus becomes d 2= ~2 dt ~ ~ ~ ~ = 2 A + 2 (A - A ) + 2 (b - b ) ~ ~ 2 A + 2(A + b ) 2 . ~ 2 Letting = A + b , the above expression can be written as d ~ 2 A + I). ~2 ~ dt The fact that the learning policy is assumed Lipschitz w.r.t. and the uniform ergodicity of the corresponding induced chain implies that A and b are also Lipschitz w.r.t. (with a different constant). This means that goes to zero with C and, therefore, for C sufficiently small, (A + I) is a negative definite matrix.4 Therefore, the ODE (13) is globally asymptotically stable and the conclusion of the theorem follows. S everal remarks are now in order. First of all, Theorem 2 basically states that, for fixed if the dependence of the learning policy can be made sufficiently "smooth", then SARSA converges w.p.1. This result is similar to the result in (Perkins & Precup, 2003), although the algorithms are not exactly similar: we consider a continuing task, while the algorithm featured in (Perkins & Precup, 2003) is implemented in an episodic fashion. Furthermore, in our case, convergence was established using an ODE argument, instead of the contraction argument in (Perkins & Precup, 2003). Nevertheless, both methods of proof are, in its essence, equivalent and the results in both papers concordant. A second remark is related with the implementation of SARSA with a decaying exploration policy. The analysis of one such algorithm could be conducted using, once again, an ODE argument. In particular, SARSA could be described as a two-time-scale algorithm: the iterations of the main algorithm (corresponding to (12)) would develop on a faster time-scale and the decaying exploration rate would develop at a slower time-scale. The analysis in (Borkar, 1997) could then be replicated. However, it is well-known that, as approaches zero, the learning policy will approach the greedy policy w.r.t. which is, in general, discontinuous. Therefore, there is little hope that the smoothness The fact that A is negative definite has been established in several works. See, for example, Lemma 3 in (Perkins & Precup, 2003) or, in a slightly different setting (easily extendable to our setting) the proof of Theorem 1 in (Tsitsiklis & Van Roy, 1996a). 4 An equilibrium point of (13) must verify = A-1 b and the existence of such equilibrium point has been established in (de Farias & Van Roy, 2000) (Theo~ rem 5.1). Let (t) = (t) - . Then, d ~ 2 = 2 ~2 dt ~ = 2 Let A = sup A - A 2 A A + b = . - A + b - b b = sup = b - b 2 , - 2 where the norm in the definition of A is the induced operator norm and the one in the definition of b is An Analysis of Reinforcement Learning with Function Approximation condition in Theorem 2 (or its equivalent in (Perkins & Precup, 2003)) can be met as approaches to zero. approaches and their relation to the results in this paper. One possible approach is to rely on soft-state aggregation (Singh et al., 1994; Gordon, 1995; Tsitsiklis & Van Roy, 1996b), partitioning the state-space into "soft" regions. Treating the soft-regions as "hyper-states", these methods then use standard learning methods (such as Q-learning or SARSA) to approximate the optimal Q-function. The main differences between such methods and those using linear function approximation (such as the ones portrayed here) are that, in the former, only one component of the parameter vector is updated at each iteration and the basis functions are, by construction, restrained to be positive and to add to one at each point of the state-space. Sample-based methods (Ormoneit & Sen, 2002; Szepesvári & Smart, 2004) further generalize the applicability of soft-state aggregation methods by using spreading functions/kernels (Ribeiro & Szepesvári, 1996). Sample-based methods thus exhibit superior convergence rate when compared with simple softstate aggregation methods, although under somewhat more restrictive conditions. Finally, RL with general linear function approximation was thorougly studied in (Tsitsiklis & Van Roy, 1996a; Tadi, 2001). Posterior works extended the applicability of such results. In Precup01icml, an offpolicy convergent algorithm was proposed that uses an importance-sampling principle similar to the one described in Section 3. In (Perkins & Precup, 2003), the authors establish the convergence of SARSA with linear function approximation. 4.2. Concluding Remarks We conclude by observing that all methods analyzed here as well as those surveyed above experience a degradation in performance as the distance between the target function and the chosen linear space increases. If the functions in the chosen linear space provide only a poor approximation of the desired function, there are no practical guarantees on the usefulness of such approximation. The error bounds derived in (Tsitsiklis & Van Roy, 1996a) are reassuring in that they state that the performance of approximate TD "gracefully" degrades as the distance between the target function and the chosen linear space increases. Although we have not addressed such topic in our analysis, we expect the error bounds in (Tsitsiklis & Van Roy, 1996a) to carry with little changes to our setting. Finally, we make no use of eligibility traces in our algorithms. However, it is just expectable that the methods described herein can easily be adapted to ac- 4. Discussion We now briefly discuss some of the assumptions in the above theorems. We start by emphasizing that all stated conditions are only sufficient, meaning that it is possible that convergence may occur even if some (or all) fail to hold. We also discuss the relation between our results and other related works from the literature. Seconsly, uniform ergodicity of a Markov chain essentially means that the chain quickly converges to a stationary behavior uniformly over the state-space and we can study any properties of the stationary chain by direct sampling.5 This property and the requirement that (x, a) > 0 for all a A and µX -almost all x X can be interpreted as a continuous counterpart to the usual condition that all state-action pairs are visited infinitely often. In fact, uniform ergodicity implies that all the regions of the state-space with positive µX measure are "sufficiently" visited (Meyn & Tweedie, 1993), and the condition (x, a) > 0 ensures that, at each state, every action is "sufficiently" tried. It appears to be a standard requirement in this continuous scenario, as it has also been used in other works (Tsitsiklis & Van Roy, 1996a; Singh et al., 1994; Perkins & Precup, 2003).6 The requirement that (x, a) > 0 for all a A and µX -almost all x X also corresponds to the concept of fair control as introduced in (Borkar, 2000). We also remark that the divergence example in (Gordon, 1996) is due to the fact that the learning policy fails to verify the Lipschitz continuity condition stated in Theorem 2 (as discussed in (Perkins & Pendrith, 2002)). 4.1. Related Work In this paper, we analyzed how RL algorithms can be combined with linear function approximation to approximate the optimal Q-function in MDPs with infinite state-spaces. In the last decade or so, several authors have addressed this same problem from different perspectives. We now briefly discuss several such Explicit bounds on the rate of convergence to stationarity are available in the literature (Meyn & Tweedie, 1994; Diaconis & Saloff-Coste, 1996; Rosenthal, 2002). However, for general chains, such bounds tend to be loose. 6 Most of these works make use of geometric ergodicity which, since we admit a compact state-space, is a consequence of uniform ergodicity. 5 An Analysis of Reinforcement Learning with Function Approximation commodate for eligibility traces, this eventually yielding better approximations (with tighter error bounds) (Tsitsiklis & Van Roy, 1996a) Meyn, S., & Tweedie, R. (1994). Computable bounds for geometric convergence rates of Markov chains. Annals of Applied Probability, 4, 981­1011. Ormoneit, D., & Sen, S. (2002). Kernel-based reinforcement learning. Machine Learning, 49, 161­178. Perkins, T., & Pendrith, M. (2002). On the existence of fixed-points for Q-learning and SARSA in partially observable domains. Proc. 19th Int. Conf. Machine Learning (pp. 490­497). Perkins, T., & Precup, D. (2003). A convergent form of approximate policy iteration. Adv. Neural Information Proc. Systems (pp. 1595­1602). Precup, D., Sutton, R., & Dasgupta, S. (2001). Offpolicy temporal-difference learning with function approximation. Proc. 18th Int. Conf. Machine Learning (pp. 417­424). Ribeiro, C., & Szepesvári, C. (1996). Q-learning combined with spreading: Convergence and results. Proc. ISRF-IEE Int. Conf. Intel ligent and Cognitive Systems (pp. 32­36). Rosenthal, J. (2002). Quantitative convergence rates of Markov chains: A simple account. Electronic Communications in Probability, 7, 123­128. Singh, S., Jaakkola, T., & Jordan, M. (1994). Reinforcement learning with soft state aggregation. Adv. Neural Information Proc. Systems (pp. 361­368). Sutton, R. (1999). Open theoretical questions in reinforcement learning. Lecture Notes in Computer Science, 1572, 11­17. Szepesvári, C., & Smart, W. (2004). Interpolationbased Q-learning. Proc. 21st Int. Conf. Machine learning (pp. 100­107). Tadi, V. (2001). On the convergence of temporaldifference learning with linear function approximation. Machine Learning, 42, 241­267. Tsitsiklis, J., & Van Roy, B. (1996a). An analysis of temporal-difference learning with function approximation. IEEE Trans. Automatic Control, 42, 674­ 690. Tsitsiklis, J., & Van Roy, B. (1996b). Feature-based methods for large scale dynamic programming. Machine Learning, 22, 59­94. Watkins, C. (1989). Learning from delayed rewards. Doctoral dissertation, King's College, University of Cambridge. Acknowledgements The authors would like to acknowledge the many useful comments from Doina Precup and the unknown reviewers. Francisco S. Melo acknowledges the support from the ICTI and the Portuguese FCT, under the Carnegie Mellon-Portugal Program. Sean Meyn gratefully acknowledges the financial support from the National Science Foundation (ECS-0523620). Isabel Ribeiro was supported by the FCT in the frame of ISR-Lisboa pluriannual funding. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. References Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. Proc. 12th Int. Conf. Machine Learning (pp. 30­37). Benveniste, A., Métivier, M., & Priouret, P. (1990). Adaptive algorithms and stochastic approximations, vol. 22. Springer-Verlag. Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Athena Scientific. Borkar, V. (1997). Stochastic approximation with two time scales. Systems & Control Letters, 29, 291­294. Borkar, V. (2000). A learning algorithm for discretetime stochastic control. Probability in the Engineering and Informational Sciences, 14, 243­258. de Farias, D., & Van Roy, B. (2000). On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105, 589­608. Diaconis, P., & Saloff-Coste, L. (1996). Logarithmic Sobolev inequalities for finite Markov chains. Annals of Applied Probability, 6, 695­750. Gordon, G. (1995). Stable function approximation in dynamic programming (Technical Report CMU-CS95-103). School of Computer Science, Carnegie Mellon University. Gordon, G. (1996). Chattering in SARSA(). (Technical Report). CMU Learning Lab Internal Report. Meyn, S., & Tweedie, R. (1993). Markov chains and stochastic stability. Springer-Verlag.