Finite-Sample Analysis of LSTD Alessandro Lazaric alessandro.lazaric@inria.fr Mohammad Ghavamzadeh mohammad.ghavamzadeh@inria.fr R´mi Munos e remi.munos@inria.fr INRIA Lille - Nord Europe, Team SequeL, 40 avenue Halley, 59650 Villeneuve d'Ascq, France Abstract In this paper we consider the problem of policy evaluation in reinforcement learning, i.e., learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning algorithm. We report a finite-sample analysis of LSTD. We first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is -mixing. 1. Introduction Least-squares temporal-difference (LSTD) learning (Bradtke & Barto, 1996; Boyan, 1999) is a widely used algorithm for prediction in general, and in the context of reinforcement learning (RL), for learning the value function V of a given policy . LSTD has been successfully applied to a number of problems especially after the development of the least-squares policy iteration (LSPI) algorithm (Lagoudakis & Parr, 2003), which extends LSTD to control problems. More precisely, LSTD computes the fixed point of the operator T , where T is the Bellman operator and is the projection operator in a linear function space. Although LSTD and LSPI have been widely used in the RL community, a finite-sample analysis of LSTD (i.e., performance bounds in terms of the number of samples) is still lacking. Most of the theoretical work analyzing LSTD have Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). been focused on the model-based case, where explicit models of the reward function and the dynamics are available. In particular, Tsitsiklis & Van Roy (1997) showed that the distance between the LSTD solution and the value function V is bounded by the distance between V and its closest approximation in the linear space, multiplied by a constant which increases as the discount factor approaches 1. In this bound, it is assumed that the Markov chain possesses a stationary distribution and the distances are measured according to . Bertsekas (2001) reported a similar analysis for the empirical version of LSTD. His analysis reveals a critical dependency on the inverse of the smallest eigenvalue of the LSTD's A matrix (note that the LSTD solution is obtained by solving the system of linear equations Ax = b). Nonetheless, Bertsekas (2001) does not provide a finite-sample analysis of the algorithm. On the other hand, Antos et al. (2008) analyzed the modified Bellman residual (MBR) minimization algorithm for a finite number of samples, bounded function spaces, and a µ-norm that might be different from the norm induced by . Although MBR minimization was shown to reduce to LSTD in case of linear spaces, it is not straightforward how the finite-sample bounds derived by Antos et al. (2008) can be extended to unbounded linear spaces considered by LSTD. In this paper, we report a finite-sample analysis of LSTD. To the best of our knowledge, this is the first complete finite-sample analysis of this widely used algorithm. Our analysis is for a specific implementation of LSTD that we call pathwise LSTD. Pathwise LSTD has two specific characteristics: 1) it takes a single trajectory generated by the Markov chain induced by policy as input, and 2) it uses the pathwise Bellman operator (will be precisely defined later), which is defined to be a contraction w.r.t. the empirical norm. We first derive a bound on the performance of the pathwise LSTD solution for a setting that we call Markov design. In this setting, the performance is evaluated at the points used by the algorithm to learn an estimate of V . This bound is general in the sense that no as- Finite-Sample Analysis of LSTD sumption is made on the existence of a stationary distribution for the Markov chain. Then, in the case the Markov chain admits a stationary distribution and is -mixing, we derive generalization bounds w.r.t. the norm induced by . Besides providing a full finite-sample analysis of LSTD, the major insights gained by the analysis in the paper can be summarized as follows. The first result is about the existence of the LSTD solution and its performance. In Theorem 1 we show that with a slight modification of the empirical Bellman operator T (leading to the definition of pathwise LSTD), the operator T (where is an empirical projection operator) has always a fixed point v even when the ^ sample-based Gram matrix is not invertible and the Markov chain does not admit a stationary distribution. In this very general setting, it is still possible to derive a bound for the performance of v evaluated on the ^ states of the trajectory, and an analysis of the bound reveals a critical dependency on the smallest strictly positive eigenvalue n of the sample-based Gram matrix. Then, in the case in which the Markov chain has a stationary distribution , it is possible to relate the value of n to the smallest eigenvalue of the Gram matrix defined according to . Furthermore, it is possible to generalize the previous performance bound over the entire state space under the measure , when the samples are drawn from a stationary -mixing process (Theorem 2). It is important to note that the asymptotic bound obtained by taking the number of samples, n, to infinity is equal (up to constants) to the bound in Tsitsiklis & Van Roy (1997) for model-based LSTD. Furthermore, a comparison with the bounds in Antos et al. (2008) shows that we successfully leverage on the specific setting of LSTD: 1) the space of functions is linear, and 2) the distribution used to evaluate the performance is the stationary distribution of the Markov chain induced by the policy. In particular, we obtain a better bound both in terms of estimation error, a rate of order O(1/n) instead of O(1/ n) for the squared error, and in terms of approximation error, the minimal distance between the value function V and the space F instead of the inherent Bellman errors of F . Finally, the extension in Theorem 3 to the case in which the samples belong to a trajectory generated by a fast mixing Markov chain shows that it is possible to achieve the same performance as in the case of stationary -mixing processes. over X , and the space of bounded measurable functions with domain X and bound 0 < L < , respectively. For a measure S(X ) and a measurable function f : X R, we define the 2 ()-norm of f , ||f || , and for a set of n states X1 , . . . , Xn X , we define the empirical norm ||f ||n as ||f ||2 = Z f (x)2 (dx) and ||f ||2 = n n 1X f (Xt )2 . n t=1 The supremum norm of f , ||f || , is defined as ||f || = supxX |f (x)|. We consider the standard RL framework (Sutton & Barto, 1998) in which a learning agent interacts with a stochastic environment by following a policy and this interaction is modeled as a discretetime discounted Markov chain (MC). A discounted MC is a tuple M = X , R , P , , where the state space X is a subset of a Euclidean space, the reward function R : X R is uniformly bounded by Rmax , the transition kernel P is such that for all x X , P (·|x) is a distribution over X , and (0, 1) is a discount factor. The value function of a policy , V , is the unique fixed-point of the Bellman operator T : B(X ; Vmax = Rmax ) B(X ; Vmax ) defined by1 1- (T V )(x) = R (x) + Z P (dy|x)V (y). X To approximate the value function V , we use a linear approximation architecture with parameters Rd and basis functions j B(X ; L), j = 1, . . . , d. We denote by : X Rd , (·) = 1 (·), . . . , d (·) the feature vector, and by F the linear function space spanned by the basis functions j . Thus F = {f , Rd }, where f (·) = (·) . Let (X1 , . . . , Xn ) be a sample path (or trajectory) of size n generated by the Markov chain M. Let v Rn and r Rn such that vt = V (Xt ) and rt = R(Xt ) be the value vector and the reward vector, respectively. Also, let = [(X1 ) ; . . . ; (Xn ) ] be the feature matrix defined at the states, and Fn = {, Rd } Rn be the corresponding vector space. We denote by : Rn Fn the orthogonal projection onto Fn , defined as y = arg minzFn ||y - z||n , where n 1 2 ||y||2 = n t=1 yt . Note that the orthogonal projecn tion y for any y Rn exists and is unique. 3. Pathwise LSTD Pathwise LSTD is a version of LSTD which takes as input a single path X1 , . . . , Xn and returns the fixedTo simplify the notation, we remove the dependency to the policy and use M, R, P , V , and T instead of M , R , P , V , and T throughout the paper. 1 2. Preliminaries For a measurable space with domain X , we let S(X ) and B(X ; L) denote the set of probability measures Finite-Sample Analysis of LSTD point of the empirical operator T , where T : Rn Rn is the pathwise Bellman operator defined as (T y)t = rt + yt+1 rt 1 t < n, t = n. Remark 2 Note that in case there exists a constant > 0, such that with probability 1 - all the eigenvalues of the sample-based Gram matrix are lower bounded by , Eq. 1 (with n replaced by ) holds with probability at least 1 - ( + ). Remark 3 When the sample-based Gram matrix 1 ^ n is not invertible, the unicity of v does not imply the unicity of the solution to the system = v . ^ However, since v is the unique fixed point of T , ^ the vector v - T v is perpendicular to the space Fn , ^ ^ and thus, (^ - T v ) = 0. By replacing v with v ^ ^ , we obtain = (r + P ) and then (I - P )^ = r. Therefore, we still have the same system of equations A = b as in Remark 1, with the exact same A and b, but now the system may have many solutions.2 Among all possible solutions, one may choose the one with minimal norm: = A+ b, ^ where A+ is the Moore-Penrose pseudo-inverse of A. Remark 4 Theorem 1 provides a bound without any reference to the stationary distribution of the Markov chain. In fact, the bound of Eq. 1 holds even when the chain does not possess a stationary distribution. For example, consider a Markov chain on the real line where the transitions always move the states to the right, i.e., p(Xt+1 dy|Xt = x) = 0 for y x. For simplicity assume that the value function V is bounded and belongs to F . This Markov chain is not recurrent, and thus, does not have a stationary distribution. We also assume that the feature vectors (X1 ), . . . , (Xn ) are sufficiently independent, so that the eigenvalues of 1 n are greater than > 0. Then according to Theorem 1, pathwise LSTD is able to estimate the value function at the states at a rate O(1/ n). This may seem surprising because at each state Xt the algorithm is only provided with a noisy estimation of the expected value of the next state. However, the estimates are unbiased conditioned on the current state, and we will see in the proof that using a concentration inequality for martingale, pathwise LSTD is able to learn a good estimate of the value function at a state Xt using noisy pieces of information at other states that may be far away from Xt . In other words, learning the value function at a given state does not require making an average over many samples close to that state. This implies that LSTD does not require the Markov chain to possess a stationary distribution. Remark 5 The most critical part of the bound in Eq. 1 is the inverse dependency on the smallest positive eigenvalue n . A similar dependency is shown in the LSTD analysis of Bertsekas (2001). The main 2 Note that since the fixed point v exists, this system ^ always has at least one solution. Note that by defining the operator P : Rn Rn as (P y)t = yt+1 for 1 t < n and (P y)n = 0, we have T y = r + P y. The motivation for using the pathwise Bellman operator is that it is -contraction in 2 -norm, i.e., for any y, z Rn , we have ||T y - T z||2 = || P (y - z)||2 2 ||y - z||2 . n n n Moreover, it can be shown that the orthogonal projection is non-expansive: since ||y||2 = y, y n n ||y||n ||y||n , using the Cauchy-Schwarz inequality we obtain ||y||n ||y||n . Therefore, from Banach fixed point theorem, there exists a unique fixed-point v of ^ the mapping T , i.e., v = T v . We call v the path^ ^ ^ wise LSTD solution. Note that the unicity of v does ^ not imply that there exists a unique parameter such ^ that v = ^ . ^ 4. Markov Design Bound Theorem 1. Let X1 , . . . , Xn be a trajectory of the Markov chain, and v, v Rn be the vectors whose com^ ponents are the value function and the pathwise LSTD solution at {Xi }n , respectively. Then with probabili=1 ity 1 - , where the probability is w.r.t. the random trajectory, we have ||^ - v||n v 1 ||v - v||n 1- d n (1) , + Vmax L 8 log(2d/) 1 + n n where the random variable n is the smallest strictlypositive eigenvalue of the sample-based Gram matrix 1 n . Remark 1 When the eigenvalues of the sample1 based Gram matrix n are non-zero, is invertible, and thus, = ( )-1 . In this case, the unicity of v implies the unicity of since ^ ^ v = = v = = = ( )-1 v . (2) ^ ^ ^ ^ Since v is the unique fixed-point of T , it can be re^ placed by T v in Eq. 2. Using the definitions of ^ and T , we obtain (I - P )^ = r. By defining A = (I - P ) and b = r, can be seen as the ^ unique solution of the d × d system of linear equations A = b. Finite-Sample Analysis of LSTD difference is that here we have a more complete finitesample analysis with an explicit dependency on the number of samples and the other characteristic parameters of the problem. Furthermore, if the Markov chain admits a stationary distribution , we are able to relate the existence of the LSTD solution to the smallest eigenvalue of the Gram matrix defined according to (see Section 5.1). In order to prove Theorem 1, we first introduce the model of regression with Markov design and then state and prove a Lemma about this model. Definition 1. The model of regression with Markov design is a regression problem where the data (Xt , Yt )1tn are generated according to the following model: X1 , . . . , Xn is a sample path generated by a Markov chain, Yt = f (Xt ) + t , where f is the target function, and the noise term t is a random variable which is adapted to the filtration generated by X1 , . . . , Xt+1 and is such that |t | C, and E[t |X1 , . . . , Xt ] = 0. (3) 1 The Gram matrix n is positive-semidefinite, thus its eigenvectors corresponding to zero eigenvalues generate K and the other eigenvectors generate its orthogonal complement K . Therefore, from the assumption 1 that the smallest strictly-positive eigenvalue of n is n , we deduce that since K , ^ = K + K , where K K and K K , ^ ^ ^ ^ ^ and because the decomposition is orthogonal, we have ||^ ||2 = ||^ K ||2 + ||^ K ||2 . Since is of minimal norm 2 2 ^ 2 ^ among all the vectors such that = , its component in K must be zero, thus K . ^ ^ ||||2 = n 1 ^ n = n ||^ ||2 . ^ ^ ^ 2 n 1/2 (6) By using the result of Eq. 6 in Eq. 5, we obtain 1 ^ ||||n n n d i=1 n t i (Xt ) t=1 2 . (7) Now, from Eq. 3, we have that E[t i (Xt )|X1 , . . . , Xt ] = i (Xt )E[t |X1 , . . . , Xt ] = 0, thus t i (Xt ) is a martingale difference sequence w.r.t. the filtration generated by the Markov chain, and one may apply Azuma's inequality to deduce that with probability 1 - , n t=1 Lemma 1 (Regression bound for the Markov design setting). We consider the model of regression with Markov design from Definition 1. Let w Fn be the ^ least-squares estimate of the (noisy) values Y = {Yt }n , 1 i.e., w = Y , and w Fn be the least-squares esti^ mate of the (noiseless) values Z = {Zt }n = {f (Xt )}n , 1 1 i.e., w = Z. Then for any > 0, with probability at least 1 - , where the probability is w.r.t. the random sample path X1 , . . . , Xn , we have 2d log(2d/) , (4) nn where n is the smallest strictly-positive eigenvalue of 1 the sample-based Gram matrix n . ||w - w||n CL ^ Proof of Lemma 1. We define Rn to be the vector ^ ^ with components t , and = w - w = (Y - Z) = . ^ Since the projection is orthogonal we have , n = ^ 2 . Since Fn , there exists at least one Rd ^ ||||n ^ such that = , so by Cauchy-Schwarz inequality we have ^ n ^ ||||2 = , d n 1X X i t i (Xt ) n i=1 t=1 " d # n "2 1/2 X"X 1 ||||2 . t i (Xt ) n i=1 t=1 n t i (Xt ) CL 2n log(2/) . By a union bound over all features, we have that with probability 1 - , for all i = 1 . . . d, n t=1 t i (Xt ) CL 2n log(2d/) . (8) The results follows by combining Eq. 8 with Eq. 7. Remark about Lemma 1 In the Markov design model considered in this lemma, states {Xt }n are 1 random variables generated according to the Markov chain and the noise terms t may depend on the next state Xt+1 (but should be centered conditioned on the past X1 , . . . , Xt ). This lemma will be used in order to prove Theorem 1, where we replace the target function f with the value function V , and the noise term t with the temporal difference r(Xt ) + V (Xt+1 ) - V (Xt ). Note that this lemma is an extension of the bound for the model of regression with deterministic design in which the states {Xt }n are fixed and the noise terms, 1 t 's, are independent. In the setting of deterministic design, usual concentration results provide high probability bounds similar to Eq. 4, but without the dependence on n . An open question is whether it is = (5) ^ Now among the vectors such that = , we define ^ to be the one with minimal 2 -norm, i.e., = + . ^ ^ Let K denote the null space of , which is also the 1 ^ null space of n . Then can be decomposed as Finite-Sample Analysis of LSTD Tv v Tv ^ Fn v = T v ^ ^ v T v term adds a contribution to the bound), thus instead of Eq. 8, we obtain with probability 1 - n t=1 t i (Xt ) Vmax L 2 2n log(2d/) + 1 , for all 1 i d. Combining with Eq. 7, we deduce that with probability 1 - , we have r "r 1" 8 log(2d/) b T v - v||n Vmax L d b b , (12) + || n n n Figure 1. This figure represents the space Rn , the linear vector subspace Fn and some vectors used in the proof of Theorem 1. The claim follows by combining Eq. 12 and 11. possible to remove n in the bound for the Markov design regression setting. Proof of Theorem 1. Step 1: Using the triangle inequality, we have (see Figure 1) bb b b ||^ - v||n ||^ - T v||n + ||T v - v||n + ||v - v||n . (9) v v bb Remark 6 In addition to Eq. 1, one may easily deduce a tighter bound (when is close to 1): ||^ - v||n v + 1 1 - 2 ||v - v||n d n 8 log(2d/) 1 + n n 1 Vmax L 1- From the -contraction of T mapping and the fact that v is its unique fixed point, we obtain ^ ||^ - T v||n = ||T v - T v||n ||^ - v||n , (10) v ^ v Thus from Eq. 9 and 10, we have 1 ||v - v||n + ||T v - v||n . (11) 1- by using Pytagora's Theorem in Step 1, i.e., ||^-v||2 v n (||^ - T v||n + ||T v - v||n )2 + ||v - v||2 instead v n of Eq. 9. 5. Generalization Bounds The generality of Theorem 1 comes at the cost that the performance is evaluated only at the states visited by the Markov chain. The reason is that no assumption about the existence of the stationary distribution of the Markov chain is made. However in many problems of interest, the Markov chain has a stationary distribution , and thus, the performance can be generalized to the whole state space under the measure . Moreover, if exists, it is possible to derive a condition for the existence of the pathwise LSTD solution depending on the number of samples and the smallest eigenvalue of the Gram matrix defined according to the stationary distribution ; G Rd×d , Gij = i (x)j (x)(dx). In this section, we assume that the Markov chain M ¯ is exponentially fast -mixing with parameters , b, , ¯ i.e., its -mixing coefficients satisfy i exp(-bi ) (see e.g., Sections 7.2 and 7.3 in Lazaric et al. 2010 for a more detailed definition of -mixing processes). Before stating the main results of this section, we introduce some notation. If is the stationary distribution of the Markov chain, we define the orthogonal projection operator : B(X ; Vmax ) F as V = arg min ||V - f || . f F ||^ - v||n v Step 2: We now provide a high probability bound on ||T v - v||n . This is a consequence of Lemma 1 applied to the vectors Y = T v and Z = v. Since v is the value function at the points {Xt }n , from the 1 definition of the pathwise Bellman operator, we have that for 1 t n - 1, t = yt - vt = r(Xt ) + V (Xt+1 ) - V (Xt ) = V (Xt+1 ) - P (dy|Xt )V (y) , and n = yn - vn = - P (dy|Xn )V (y). Thus, Eq. 3 holds for 1 t n - 1. Here we may choose C = 2Vmax for a bound on t , 1 t n - 1, and C = Vmax for a bound on n . Azuma's inequality may only be applied to the sequence of n - 1 terms (the n-th (13) Finite-Sample Analysis of LSTD Furthermore, in the rest of the paper with a little abuse of notation, we replace the empirical norm ||v||n den fined on states X1 by ||V ||n , where V B(X ; Vmax ) is such that V (Xt ) = vt . Finally, we should guaran^ tee that the pathwise LSTD solution V is uniformly bounded on X . For this reason, we move from F to ~ the truncated space F . A function f F is defined as ~ f (x) = f (x) ` ´ sgn f (x) Vmax if |f (x)| Vmax , otherwise. (14) = s 288L2 (n, ) max n Thus we obtain 2||f ||n + ff1/ (n, ) ,1 . b ||||. (17) In the next sections, we present conditions on the existence of the pathwise LSTD solution and derive generalization bounds under different assumptions on the way that the samples X1 , . . . , Xn are generated. 5.1. Existence of Pathwise LSTD Solution In this section, we assume that all the eigenvalues of G are strictly positive and derive a condition to guaran1 tee that the sample-based Gram matrix n is invertible. In particular, we show that if a large enough number of samples (depending on the smallest eigenvalue of G) is available, then the smallest eigenvalue 1 of n is strictly positive with high probability. Lemma 2. Let G be the Gram matrix defined according to the distribution and > 0 be its smallest eigenvalue. Let X1 , . . . , Xn be a path of length n of a stationary -mixing process with stationary distribution . If the number of samples n satisfies the following condition (n, ) max n ff1/ (n, ) < ,1 , b 288L2 (15) Let be such that ||f ||n = 0, then from Eq. 17 and the definition of we deduce that = 0. Thus n > 0 and the inequality in Eq. 16 is obtained by choosing 1 to be the eigenvector of n correspond to the smallest eigenvalue n . For this value of , we have ||f ||n = n ||||, and the claim follows using Eq. 17. ¯ Remark 1 If (n, )/b > 1 and n 6, the condition on the number of samples can be rewritten as log `e As it can be seen, the number of samples needed to have strictly positive eigenvalues in the sample-based Gram matrix has an inverse dependency on the smallest eigenvalue of G. As a consequence, the more G is ill-conditioned the more samples we need for the 1 sample-based Gram matrix n to be invertible. 5.2. Generalization Bounds for Stationary -mixing Processes In this section, we show how Theorem 1 can be generalized to the entire state space X when the Markov chain M has a stationary distribution . In particular, we consider the case in which the samples X1 , . . . , Xn are obtained by following a single trajectory in the stationary regime of M, i.e., when we consider that X1 is drawn from . Theorem 2. Let X1 , . . . , Xn be a path generated by a stationary -mixing process with stationary distribution . Let be the smallest eigenvalue of the Gram matrix defined according to and n satisfy the condi~ tion in Eq. 15. Let V be the truncation (using Eq. 14) of the pathwise LSTD solution, then ~ ||V - V || 2 h (18) 2 2||V - V || + 2 1- r "r d 8 log (8d/) 1 "i + Vmax L + 1 + n n ´ ¯ n n 1+ 288L2 . b1/ e ¯ where (n, ) = log + log max{6, n} , then with probability 1 - , the family of features (1 , . . . , d ) is linearly independent on the states X1 , . . . , Xn (i.e., ||f ||n = 0 implies = 0) and the smallest eigenvalue 1 n of the sample-based Gram matrix n satisifies n - 2 s 72L2 (n, ) max n ff1/ (n, ) ,1 >0. b (16) Proof. From the definition of the Gram matrix and the fact that is its smallest eigenvalue, for any function f F, we have ||f ||2 = || ||2 = G = ||||2 . Using a concentration inequality (see Corollary 4 of Lazaric et al. 2010) and the fact that the basis functions j are bounded by L, thus f is bounded by L||||, we have ||f || - 2||f ||n with probability 1 - , where with probability 1 - , where is a lower bound on the eigenvalues of the sample-based Gram matrix defined by Eq. 16, s 1 = (n, d, /4) max nC2 ff1/ (n, d, /4) ,1 b Finite-Sample Analysis of LSTD with (n, d, /4) = 2(d + 1) log n + log 4e + 2 ¯ log+ max{18(C1 C2 )2(d+1) , } , C1 = 6912eVmax, and 2 C2 = (1152Vmax )-1 , and 2 = s ` ´2 (n, /4) 288 Vmax + L|| || max n ff1/ (n, /4) ,1 b ¯ where (n, /4) = log 4e + log max{6, n} and is such that f = V . Proof. This result is a consequence of applying generalization bounds to both sides of Eq. 1 (Theorem 1). We first bound the left-hand side. ^ ~ ~ 2||V - V ||n 2||V - V ||n ||V - V || - 1 with probability 1 - . The first step follows from the definition of the truncation operator, while the second step is a straightforward application of Corollary 3 in Lazaric et al. (2010). We now bound the term ||V - V ||n in Eq. 1: b ||V - V ||n ||V - V ||n 2 2||V - V || + 2 with probability 1 - . The first step follows from the definition of the operator . The second step is an application of the inequality of Corollary 4 in Lazaric et al. (2010) for the function V - V . From Theorem 1, the two generalization bounds, and the lower bound on , each one holding with probability 1 - , the statement of the Theorem (Eq. 18) holds with probability 1 - by setting = 4 . Remark 1 Rewriting the bound in terms of the approximation and estimation error terms (up to constants and logarithmic factors), we obtain ~ ||V - V || O ,, 1 1 1 ||V - V || + 1- 1- n « . than the stationary distribution of the Markov chain ). Since Antos et al. (2008) showed that the MBR minimization algorithm is equivalent to LSTD when F is a linearly parameterized space, it may be interesting to compare Theorem 2 to the bound in Lemma 11 of Antos et al. (2008). In Theorem 2, similar to Antos et al. (2008), samples are drawn from a stationary -mixing process, however, F is a linear space and is the stationary distribution of the Markov chain. It is interesting to note the impact of these two differences in the final bound. The use of linear spaces has a direct effect on the estimation error and leads to a better convergence rate due to the use of improved functional concentration inequalities (Lemma 5 in Lazaric et al. 2010). In fact, while in Antos et al. (2008) the estimation error for the squared error is of order O(1/ n), here we achieve a faster convergence rate of order O(1/n). The use of instead of an arbitrary measure µ has a significant impact on the approximation error. The approximation error in Eq. 18 ||V - V || only depends on how well the space F can approximate the value function V . On the other hand, the approximation error of Antos et al. (2008) contains terms that are related to more complex properties of the space, such as its capability to approximate any function obtained by applying the Bellman operator T to any function in F . This term called the inherent Bellman error can be shown to be small only for specific classes of MDPs (e.g., Lipschitz MDPs). Finally, it is interesting to notice that although the solution of MBR minimization reduces to LSTD, its sample-based analysis cannot be directly used for LSTD. In fact, in Antos et al. (2008) the function space F is assumed to be bounded, while general linear spaces cannot be bounded. Whether the analysis of Antos et al. (2008) ~ can be extended to the truncated solution V of LSTD is an open question that requires further investigation. 5.3. Generalization Bounds for Markov Chains The main assumption in the previous section is that X1 , . . . , Xn is generated by a stationary -mixing process with stationary distribution . This is possible if we consider samples of a Markov chain during its stationary regime, i.e. X1 . However in practice, is not known, and the first sample X1 is usually drawn from a given initial distribution and the rest of the sequence is obtained by following the Markov chain from X1 on. As a result, the sequence X1 , . . . , Xn is no longer a realization of a stationary -mixing process. Nonetheless, under suitable conditions, after n < n ~ steps, the distribution of Xn approaches the station~ ary distribution . In fact, according to the convergence theorem for fast-mixing Markov chains (see e.g., Proposition 3 in Lazaric et al. 2010), for any initial While the first term (approximation error) only depends on the target function V and the function space F , the second term (estimation error) primarily depends on the number of samples. Thus, when n goes to infinity, the estimation error goes to zero and we obtain the same performance bound (up to a 4 2 constant) as for the model-based case reported by Tsitsiklis & Van Roy (1997). Remark 2 Antos et al. (2008) reported a samplebased analysis for the modified Bellman residual (MBR) minimization algorithm. They consider a general setting in which the function space F is bounded and the performance of the algorithm is evaluated according to an arbitrary measure µ (possibly different Finite-Sample Analysis of LSTD distribution S(X ), we have || ¯ (dx)P n (·|x) - (·)||T V exp(-bn ). X We now derive a bound for a modification of pathwise LSTD in which the first n samples (that are used to ~ burn the chain) are discarded and the remaining n - n ~ samples are used as training samples for the algorithm. Theorem 3. Let X1 , . . . , Xn be a trajectory generated by a -mixing Markov chain with stationary distribution . Let n (1 n < n) be such that n - n satis~ ~ ~ fies the condition of Eq. 15, and Xn+1 , . . . , Xn be the ~ samples actually used by the algorithm. Let be the smallest eigenvalue of the Gram matrix defined accord~ ing to and Rd be such that f = V . Let V be the truncation of the pathwise LSTD solution (using Eq. 14), then by setting n = ~ ~ ||V - V || 1 b n log 2e ¯ 1/ a stationary distribution, then one can deduce generalization performance bounds using the stationary distribution of the chain as our generalization measure. In particular, we considered the cases where the sample trajectory is generated by stationary and nonstationary -mixing Markov chains and derived the corresponding bounds. This work raises two open questions: 1) Is it possible to remove the dependence to n in the bound of Theorem 1? 2) Is it possible to extend the current analysis to the general case of LSTD()? Acknowledgments This work was supported by French National Research Agency (ANR) (project EXPLO-RA n ANR-08-COSI-004). References Antos, A., Szepesv´ri, Cs., and Munos, R. Learna ing near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning Journal, 71:89­129, 2008. Bertsekas, D. Dynamic Programming and Optimal Control. Athena Scientific, 2001. Boyan, J. Least-squares temporal difference learning. Proceedings of the 16th International Conference on Machine Learning, pp. 49­56, 1999. Bradtke, S. and Barto, A. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33­57, 1996. Lagoudakis, M. and Parr, R. Least-squares policy iteration. Journal of Machine Learning Research, 4: 1107­1149, 2003. Lazaric, A., Ghavamzadeh, M., and Munos, R. Finitesample analysis of LSTD. Technical Report inria00482189, INRIA, 2010. Sutton, R. and Barto, A. Reinforcement Learning: An Introduction. MIP Press, 1998. Tsitsiklis, J. and Van Roy, B. An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674­ 690, 1997. , we have 2 h 2 2||V - V || + 2 (19) 1- # r "r 1 " d 8 log (8d/) + + 1 +Vmax L n-n ~ n-n ~ with probability 1 - , where 1 and 2 are defined as in Theorem 2 (with n - n as the number of training ~ samples). The proof of this result is a simple consequence of Lemma 8 in Lazaric et al. (2010) applied to Theorem 2. Remark 1 The bound in Eq. 19 indicates that in the case of -mixing Markov chains, a similar performance to the one for stationary -mixing processes is obtained by discarding the first n = O(log n) samples. ~ 6. Conclusions In this paper we presented a finite-sample analysis of a natural version of LSTD, called pathwise LSTD. We first considered a general setting where we do not make any assumption about the Markov chain. We derived an empirical performance bound which indicates how close the LSTD solution is to the value function V at the states generated by the Markov chain. The bound is expressed in terms of the best possible approximation of V (approximation error) in the linear approximation space F and an estimation error term which depends on the number of samples (the quadratic error scales with O(n-1/2 )) and the smallest strictlypositive eigeinvalue of the sample-based Gram matrix. We then showed that when the Markov chain possesses