Fitted Q-iteration in continuous action-space MDPs ´ Andras Antos Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary antos@sztaki.hu ´ Remi Munos SequeL team, INRIA Futurs University of Lille 59653 Villeneuve d'Ascq, France remi.munos@inria.fr ´ Csaba Szepesvari Department of Computing Science University of Alberta Edmonton T6G 2E8, Canada szepesva@cs.ualberta.ca Abstract We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems. 1 Preliminaries We will build on the results from [1, 2, 3] and for this reason we use the same notation as these papers. The unattributed results cited in this section can be found in the book [4]. A discounted MDP is defined by a quintuple (X , A, P, S, ), where X is the (possible infinite) state space, A is the set of actions, P : X × A M (X ) is the transition probability kernel with P (·|x, a) defining the next-state distribution upon taking action a from state x, S (·|x, a) gives the corresponding distribution of immediate rewards, and (0, 1) is the discount factor. Here M (X ) denotes the space of probability measures over X . We start with the following mild assumption on the MDP: Assumption A1 (MDP Regularity) X is a compact subset of the dX -dimensional Euclidean space, ^ A [-A , A ]dA . We assume that the random immediare rewards are bounded by Rmax , and t that the expected immediate reward function, r(x, a) = S (dr|x, a), is uniformly bounded by Rmax : r Rmax , where · denotes the supremum norm. A policy determines the next action given the past observations. Here we shall deal with stationary (Markovian) policies which choose an action in a stochastic way based on the last observation only. The value of a policy when it is started from a state x is defined as the t al expected discounted ot reward that is encountered while the policy is executed: V (x) = E [ t=0 t Rt |X0 = x] . Here Rt S (·|Xt , At ) is the reward received at time step t, the state, Xt , evolves according to Xt+1 Also with: Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary. 1 P (·|Xt , At ), where At is sampled from the distribution determined y . We use Q : X × A R b to denote the action-value function of policy : Q (x, a) = E [ t=0 t Rt |X0 = x, A0 = a]. The goal is to find a policy that attains the best possible values, V (x) = sup V (x), at all states x X . Here V is called the optimal value function and a policy that satisfies V (x) = V (x) for all x X is called optimal. The optimal action-value function Q (x, a) is Q (x, a) = sup Q (x, a). We say that a (deterministic stationary) policy is greedy w.r.t. an action-value function Q B (X × A), and we write = (·; Q), if, for all x X , (x) argmaxaA Q(x, a). ^ Under mild technical assumptions, such a greedy policy always exists. Further, the greedy policy w.r.t. Q is optimal. For a deterministic stationary policy , we define its evaluation operator, Q T : B (X × A) B (X × A), by (T Q)(x, a) = r(x, a) + (y , (y ))P (dy |x, a). It is known that Q = T Q . Further, if wes let the Bellman operator, T : B (X × A) B (X × A), defined by (T Q)(x, a) = r(x, a) + upaA Q(y , a)P (dy |x, a) then Q = T Q . It is known that V and Q are bounded by Rmax /(1 - ), just like Q and V . Throughout the paper F {f : X × A R} will denote a subset of real-valued functions over the state-action space X × A and AX will be a set of policies. 2 Fitted Q-iteration with approximate policy maximization We assume that we are given a finite trajectory, {(Xt , At , Rt )}1tN , generated by some stochastic stationary policy b , called the behavior policy: At b (·|Xt ), Xt+1 P (·|Xt , At ), Rt def S (·|Xt , At ), where b (·|x) is a density with 0 = inf (x,a)X ×A b (·|x) > 0. The generic recipe for fitted Q-iteration (FQI) [5] is Qk+1 = Regress(Dk (Qk )), (1) where Regress is an appropriate regression procedure and Dk (Qk ) is a dataset defining a regression problem in the form of a list of data-point pairs: ( . 1 1 Dk (Qk ) = Xt , At ), Rt + max Qk (Xt+1 , b) bA tN Fitted Q-iteration can be viewed as approximate value iteration applied to action-value functions. To see this note that value iteration would assign the value (T Qk )(x, a) = r(x, a) + m axbA Qk (y , b) P (dy |x, a) to Qk+1 (x, a) [6]. Now, remember that the regression function for the jointly distributed random variables (Z, Y ) is defined by the conditional expectation of Y given Z : m(Z ) = E [Y |Z ]. Since for any fixed function Q, E [Rt + maxbA Q(Xt+1 , b)|Xt , At ] = (T Q)(Xt , At ), thus the regression function corresponding to the dataset Dk (Q) is indeed T Q and hence if FQI could solve the regression problem defined by Qk exactly, it would simulate value iteration exactly. However, this argument itself does not directly lead to a rigorous analysis of FQI: Since Qk is obtained based on the data, it is itself a random function. Hence, after the first iteration, the "target" function in FQI becomes random. Furthermore, this function depends on the same data that is used to define the regression problem. Will FQI still work despite these issues? To illustrate the potential difficulties consider a dataset where X1 , . . . , XN is a sequence of independent random variables, which are all distributed uniformly at random in [0, 1]. Further, let M be a random integer greater than N which is independent of the dataset (Xt )N 1 . Let U be another random variable, uniformly t= distributed in [0, 1]. Now define the regression problem by Yt = fM ,U (Xt ), where fM ,U (x) = sgn(sin(2M 2 (x + U ))). Then it is not hard to see that no matter how big N is, no procedure can estimate the regression function fM ,U with a small error (in expectation, or with high probability), even if the procedure could exploit the knowledge of the specific form of f . On the other hand, if we restricted M to a finite range then the estimation problem could be solved successfully. The example shows that if the complexity of the random functions defining the regression problem is uncontrolled then successful estimation might be impossible. 1 Since the designer controls Qk , we may assume that it is continuous, hence the maximum exists. 2 Amongst the many regression methods in this paper we have chosen to work with least-squares methods. In this case Equation (1) takes the form 2 Q R tN 1 . Qk+1 = argmin (Xt , At ) - + max Qk (Xt+1 , b) (2) t bA (At |Xt ) Q F =1 b We call this method the least-squares fitted Q-iteration (LSFQI) method. Here we introduced the weighting 1/b (At |Xt ) since we do not want to give more weight to those actions that are preferred by the behavior policy. Other weighting factors, expressing ones' beliefs about the behavior policy is, are also possible. Besides this weighting, the only parameter of the method is the function set F . This function set should be chosen carefully, to keep a balance between the representation power and the number of samples. As a specific example for F consider neural networks with some fixed architecture. In this case the function set is generated by assigning weights in all possible ways to the neural net. Then the above minimization becomes the problem of tuning the weights. Another example is to use linearly parameterized function approximation methods with appropriately selected basis functions. In this case the weight tuning problem would be less demanding. Yet another possibility is to let F be an appropriate restriction of a Reproducing Kernel Hilbert Space (e.g., in a ball). In this case the training procedure becomes similar to LS-SVM training [7]. As indicated above, the analysis of this algorithm is complicated by the fact that the new dataset is defined in terms of the previous iterate, which is already a function of the dataset. Another complication is that the samples in a trajectory are in general correlated and that the bias introduced by the imperfections of the approximation architecture may yield to an explosion of the error of the procedure, as documented in a number of cases in, e.g., [8]. Nevertheless, at least for finite action sets, the tools developed in [1, 3, 2] look suitable to show that under appropriate conditions these problems can be overcome if the function set is chosen in a judicious way. However, the results of these works would become essentially useless in the case of an infinite number of actions since these previous bounds grow to infinity with the number of actions. Actually, we believe that this is not an artifact of the proof techniques of these works, as suggested by the counterexample that involved random targets. The following result elaborates this point further: Proposition 2.1. Let F B (X × A). Then even if the pseudo-dimension of F is finite, the fatshattering function of c V Fmax = Q : VQ (·) = max Q(·, a), Q F a A an be infinite over (0, 1/2).2 Without going into further details, let us just note that the finiteness of the fat-shattering function is a sufficient and necessary condition for learnability and the finiteness of the fat-shattering function is implied by the finiteness of the pseudo-dimension [9].The above proposition thus shows that without imposing further special conditions on F , the learning problem may become infeasible. One possibility is of course to discretize the action space, e.g., by using a uniform grid. However, if the action space has a really high dimensionality, this approach becomes unfeasible (even enumerating 2dA points could be impossible when dA is large). Therefore we prefer alternate solutions. Another possibility is to make the functions in F e.g. uniformly Lipschitz in their state coordinates. Then the same property will hold for functions in Fmax and hence by a classical result we can bound the capacity of this set (cf. pp. 353­357 of [10]). One potential problem with this approach is that this way it might be difficult to get a fine control of the capacity of the resulting set. In the approach explored here we modify the fitted Q-iteration algorithm by introducing a policy set and a search over this set for an approximately greedy policy in a sense that will be made precise in a minute. Our algorithm thus has four parameters: F , , K, Q0 . Here F is as before, 2 The proof of this and the other results are given in the appendix, available in the extended version of this paper, downloadable from http://hal.inria.fr/inria- 00185311/en/. 3 is a user-chosen set of policies (mappings from X to A), K is the number of iterations and Q0 is an initial value function (a typical choice is Q0 0). The algorithm computes a sequence of iterates (Qk , k ), k = 0, . . . , K , defined by the following equations: ^ 0 ^ Q k +1 k +1 ^ = argmax tN =1 Q0 (Xt , (Xt )). Q R 1 (Xt , At ) - t + Qk (Xt+1 , k (Xt+1 )) ^ b (At |Xt ) Qk+1 (Xt , (Xt )). 2 , (3) (4) = argmin Q F tN =1 = argmax tN =1 Thus, (3) is similar to (2), while (4) defines the policy search problem. The policy search will generally be solved by a gradient procedure or some other appropriate method. The cost of this step will be primarily determined by how well-behaving the iterates Qk+1 are in their action arguments. For example, if they were quadratic and if was linear then the problem would be a quadratic optimization problem. However, except for special cases3 the action value functions will be more complicated, in which case this step can be expensive. Still, this cost could be similar to that of searching for the maximizing actions for each t = 1, . . . , N if the approximately maximizing actions are similar across similar states. This algorithm will be shown to overcome the above mentioned complexity control problem provided that the complexity of is controlled appropriately. Indeed, in this case set of possible regression problems is determined by the set F = { f : f (·) = Q(·, (·)), Q F , } , and the proof will rely on controlling the complexity of F by selecting F and appropriately. 3 The main theoretical result 3.1 Outline of the analysis In order to gain some insight into the behavior of the algorithm, we provide a brief summary of its error analysis. The main result will be presented subsequently. For f ,Q F and a policy , we define the tth TD-error as follows: ^ dt (f ; Q, ) = Rt + Q(Xt+1 , (Xt+1 )) - f (Xt , At ). ^ ^ Further, we define the empirical loss function by N 1t d2 (f ; Q, ) ^ t ^ LN (f ; Q, ) = ^ , N =1 (A)b (At |Xt ) (5) where the normalization with (A) is introduced for mathematical convenience. Then (3) can be written compactly as ^ Qk+1 = argmin LN (f ; Qk , k ). ^ f F AX 2 2 Let Q = Q (x, a)d (x)dA (a), where is the steady-state distribution of {Xt } and A is the uniform distribution over A. The algorithm can then be motivated by the observation that for any f ,Q and , LN (f ; Q, ) is an unbiased estimate of ^^ ^ 2 def f ^ L(f ; Q, ) = ^ - T Q + L (Q, ), ^ (6) where the first term is the error we are interested in and the second term captures the variance of the random samples: A L (Q, ) = ^ E [Var [R1 + Q(X2 , (X2 ))|X1 , A1 = a]] dA (a). ^ This result is stated formally in the following lemma: 3 Linear quadratic regulation is such a nice case. It is interesting to note that in this special case the obvious choices for F and yield zero error in the limit, as can be proven based on the main result of this paper. 4 Lemma 3.1 (Unbiased Loss Approximation). Assume that 0 > 0. Then for any f ,Q F , policy , LN (f ; Q, ) as defined by (5) provides an unbiased estimate to L(f ; Q, ): ^^ ^ ^ ^ = E LN (f ; Q, ) ^ L(f ; Q, ). ^ (7) Since the variance term in (6) is independent of f , argminf F L(f ; Q, ) ^ = f 2 ^ argminf F - T Q . Thus, if k were greedy w.r.t. Qk then argminf F L(f ; Qk , k ) = ^ ^ argminf F f - T Qk 2 . Hence we can still think of the procedure as approximate value iteration over the space of action-value functions, projecting T Qk using empirical risk minimization on the space F w.r.t. · distances in an approximate manner. Since k is only approximately greedy, we ^ will have to deal with both the error coming from the approximate projection and the error coming from the choice of k . To make this clear, we write the iteration in the form ^ ^ ^ Qk+1 = T k Qk + k = T Qk + k + (T k Qk - T Qk ) = T Qk + k , (8) ^ ^ where k is the error committed while computing T k Qk , k = T k Qk - T Qk is the error committed because the greedy policy is computed approximately and k = k + k is the total error of step k . Hence, in order to show that the procedure is well behaved, one needs to show that both errors are controlled and that when the errors are propagated through these equations, the resulting error stays controlled, too. Since we are ultimately interested in the performance of the policy obtained, we will also need to show that small action-value approximation errors yield small performance losses. For these we need a number of assumptions that concern either the training data, the MDP, or the function sets used for learning. def 3.2 Assumptions 3.2.1 Assumptions on the training data We shall assume that the data is rich, is in a steady state and is fast-mixing, where, informally, mixing means that future depends weakly on the past. Assumption A2 (Sample Path Properties) Assume that {(Xt , At , Rt )}t=1,...,N is the sample path of b , a stochastic stationary policy. Further, assume that {Xt } is strictly stationary (Xt M (X )) and exponentially -mixing with the actual rate given by the parameters def ( , b, ).4 We further assume that the sampling policy b satisfies 0 = inf (x,a)X ×A b (a|x) > 0. The -mixing property will be used to establish tail inequalities for certain empirical processes. Note that the mixing coefficients do not need to be known. In the case when no mixing condition is satisfied, learning might be impossible. To see this just consider the case when X1 = X2 = . . . = XN . Thus, in this case the learner has many copies of the same random variable and successful generalization is thus impossible. We believe that the assumption that the process is in a steady state is not essential for our result, as when the process reaches its steady state quickly then (at the price of a more involved proof) the result would still hold. 3.2.2 Assumptions on the MDP In order to prevent the uncontrolled growth of the errors as they are propagated through the updates, we shall need some assumptions on the MDP. A convenient assumption is the following one [11]: Assumption A3 (Uniformly stochastic transitions) For all x X and a A, assume that P (·|x, a) is absolutely continuous w.r.t. and thedRadon-Nikodym derivative of P w.r.t. is bounded uniformly with bound C : C = supxX ,aA def P (·|x,a) d < +. Note that by the definition of measure differentiation, Assumption A3 means that P (·|x, a) C (·). This assumption essentially requires the transitions to be noisy. We will also prove (weaker) results under the following, weaker assumption: 4 For the definition of -mixing, see e.g. [2]. 5 Assumption A4 (Discounted-average concentrability of future-state distributions) Given , , m 1 and an arbitrary sequence of stationary policies {m }m1 , assume that the futuredef state distributdon P 1 P 2 . . . P m is absolutely continuous w.r.t. . Assume that c(m) = i 1 2 m m def m- 1 sup1 ,...,m (P Pd ...P ) satisfies C, = (1 - )2 c(m) < +. We shall 1 m call C, the discounted-average concentrability coefficient of the future-state distributions. The number c(m) measures how much can get amplified in m steps as compared to the reference distribution . Hence, in general we expect c(m) to grow with m. In fact, the condition that C,µ is finite is a growth rate condition on c(m). Thanks to discounting, C,µ is finite for a reasonably large class of systems (see the discussion in [11]). A related assumption is needed in the error analysis of the approximate greedy step of the algorithm: Assumption A5 (The random policy "goes everywhere") Consider the distribution µ = ( × A )P which is the distribution of a state that results from sampling an initial state according to and then executing an action which is selected uniformly at random.5 Then = dµ/d < +. Note that under Assumption A3 we have C . This (very mild) assumption means that after one step, starting from the random policy cannot avoid some part of the state space. Besides, we assume that A has the following regularity property: Let Py(a, h, ) = { (a , v ) : a - a 1 , 0 v /h a - a - a 1 / } be denote.d the pyramid with center a and 1 dA base given by the 1-ball B (a, ) = R : a-a 1 Assumption A6 (Regularity of the action space) Let denote the Lebesgue-measure. We assume that there exists > 0, such that for all a A, for all > 0, . (Py(a, 1, ) (A × R)) (A) min , (Py(a, 1, )) (B (a, )) For example, if A is an L1 -ball itself, then this assumption will be satisfied with = 2-dA . Without assuming any smoothness of the MDP, learning in infinite MDPs looks hard (see, e.g., [12, 13]). Here we employ the following extra condition: Assumption A7 (Lipschitzness of the MDP in the actions) Assume that the transition probabilities and rewards are Lipschitz w.r.t. their action variable, i.e., there exists LP , Lr > 0 such that for all (x, a, a ) X × A × A and measurable set B of X , |P (B |x, a) - P (B |x, a )| LP a - a 1 , |r(x, a) - r(x, a )| Lr a - a 1 . Note that previously Lipschitzness w.r.t. the state variables was used e.g. in [11] to construct consistent planning algorithms. 3.2.3 Assumptions on the function sets used by the algorithm These assumptions are less demanding since they are under the control of the user of the algorithm. However, the choice of these function sets will greatly influence the performance of the algorithm, as we shall see it from the bounds. The first assumption concerns the class F : Assumption A8 (Lipschitzness of candidate action-value functions) Assume F B (X × A) and that any elements of F is uniformly Lipschitz in its action-argument in the sense that |Q(x, a) - Q(x, a )| LA a - a 1 holds for any x X , a, a A and Q F . We shall also need to control the capacity of our function sets. We assume that the reader is familiar with the concept of V C-dimension.6 Here we use the pseudo-dimension of function sets that builds upon the concept of V C-dimension: Remember that A denotes the uniform distribution over the action set A. Readers not familiar with V C-dimension are suggested to consult a book, such as the one by Anthony and Bartlett [14]. 6 5 6 Definition 3.2 (Pseudo-dimension). The pseudo-dimension VF + of F is defined as the V Cdimension of the subgraphs of functions in F (hence it is also called the V C-subgraph dimension of F ). Since A is multidimensional, we define V+ to be the sum of the pseudo-dimensions of the coordinate projection spaces, k of : V + = d kA =1 V + , k k = { k : X R : = (1 , . . . , k , . . . , dA ) } . Now we are ready to state our assumptions on our function sets: Assumption A9 (Capacity of the function and policy sets) Assume that F B (X × A; Qmax ) for Qmax > 0 and VF + < +. Also, V+ < +. Besides their capacity, one shall also control the approximation power of the function sets involved. Let us first consider the poA cy set . Define the operatoV E by (E Q)(x) = maxaA Q(x, a), li r operator E by (E Q) = Q(x, a)d (a|x) and V = (x)d (x). Introduce e (F , ) = sup inf (E Q - E Q). Q F Note that inf (E Q - E Q) measures the quality of approximating E Q by E Q. Hence, e (F , ) measures the worst-case approximation error of E Q as Q is changed within F . This can be made small by choosing large. Another related quantity is the one-step Bellman-error of F w.r.t. . This is defined as follows: For ^ a fixed policy , the one-step Bellman-error of F w.r.t. T is defined as ^ Q- E1 (F ; ) = sup inf ^ T ^Q . QF Q F Taking again a pessimistic approach, the one-step Bellman-error of F is defined as E1 (F , ) = sup E1 (F ; ). ^ ^ Typically by increasing F , E1 (F , ) can be made smaller (this is discussed at some length in [3]). However, it also holds for both and F that making them bigger will increase their capacity (pseudo-dimensions) which leads to an increase of the estimation errors. Hence, F and must be selected to balance the approximation and estimation errors, just like in supervised learning. X For p 1, let V p, = |f (x)|p d(x). Our main result is the following theorem: p Theorem 3.3. Under Assumptions A1, A2, and A5­A9, for all > 0 we have with probability at least 1 - : given Assumption A3 (respectively A4), V - V K (resp. V - V K 1, ), is bounded by d 1+1 +1 A 4 (log N + log(K/ )) C E1 (F , ) + e (F , ) + + K . 1 /4 N where C depends on dA , VF + , (V+ )dA 1 , , , b, , C (resp. C, ), , LA , LP ,Lr , , (A), 0 , k= k ^ Qmax , Rmax , Rmax , and A . In particular, C scales with V 4(dA +1) , where V = 2VF + + V+ plays the role of the "combined effective" dimension of F and . +1 4 Discussion We have presented what we believe is the first finite-time bounds for continuous-state and actionspace RL that uses value functions. Further, this is the first analysis of fitted Q-iteration, an algorithm that has proved to be useful in a number of cases, even when used with non-averagers for which no 7 previous theoretical analysis existed (e.g., [15, 16]). In fact, our main motivation was to show that there is a systematic way of making these algorithms work and to point at possible problem sources the same time. We discussed why it can be difficult to make these algorithms work in practice. We suggested that either the set of action-value candidates has to be carefully controlled (e.g., assuming uniform Lipschitzness w.r.t. the state variables), or a policy search step is needed, just like in actorcritic algorithms. The bound in this paper is similar in many respects to a previous bound of a Bellman-residual minimization algorithm [2]. It looks that the techniques developed here can be used to obtain results for that algorithm when it is applied to continuous action spaces. Finally, although we have not explored them here, consistency results for FQI can be obtained from our results using standard methods, like the methods of sieves. We believe that the methods developed here will eventually lead to algorithms where the function approximation methods are chosen based on the data (similar to adaptive regression methods) so as to optimize performance, which in our opinion is one of the biggest open questions in RL. Currently we are exploring the possibility of this. Acknowledgments ´ Andras Antos would like to acknowledge support for this project from the Hungarian Academy of ´ Sciences (Bolyai Fellowship). Csaba Szepesvari greatly acknowledges the support received from the Alberta Ingenuity Fund, NSERC, the Computer and Automation Research Institute of the Hungarian Academy of Sciences. References ´ [1] A. Antos, Cs. Szepesvari, and R. Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. In COLT-19, pages 574­588, 2006. ´ [2] A. Antos, Cs. Szepesvari, and R. Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 2007. (accepted). ´ [3] A. Antos, Cs. Szepesvari, and R. Munos. Value-iteration based fitted policy iteration: learning with a single trajectory. In IEEE ADPRL, pages 330­337, 2007. [4] D. P. Bertsekas and S.E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York, 1978. [5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503­556, 2005. [6] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998. [7] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press, 2000. [8] J.A. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In NIPS-7, pages 369­376, 1995. [9] P.L. Bartlett, P.M. Long, and R.C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52:434­452, 1996. [10] A.N. Kolmogorov and V.M. Tihomirov. -entropy and -capacity of sets in functional space. American Mathematical Society Translations, 17(2):277­364, 1961. ´ [11] R. Munos and Cs. Szepesvari. Finite time bounds for sampling based fitted value iteration. Technical report, Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary, 2006. [12] A.Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pages 406­415, 2000. [13] P.L. Bartlett and A. Tewari. Sample complexity of policy search with known dynamics. In NIPS-19. MIT Press, 2007. [14] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [15] M. Riedmiller. Neural fitted Q iteration ­ first experiences with a data efficient neural reinforcement learning method. In 16th European Conference on Machine Learning, pages 317­328, 2005. [16] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In AAMAS-07, 2007. 8