Least-Squares  Policy Iteration: Bias-Variance Trade-off in Control Problems
Christophe Thiery Bruno Scherrer LORIA - INRIA Lorraine - Campus Scientifique - BP 239 54506 Vandoeuvre-l`s-Nancy CEDEX - FRANCE e thierych@loria.fr scherrer@loria.fr

Abstract
In the context of large space MDPs with linear value function approximation, we introduce a new approximate version of -Policy Iteration (Bertsekas & Ioffe, 1996), a method that generalizes Value Iteration and Policy Iteration with a parameter   (0, 1). Our approach, called Least-Squares  Policy Iteration, generalizes LSPI (Lagoudakis & Parr, 2003) which makes efficient use of training samples compared to classical temporaldifferences methods. The motivation of our work is to exploit the  parameter within the least-squares context, and without having to generate new samples at each iteration or to know a model of the MDP. We provide a performance bound that shows the soundness of the algorithm. We show empirically on a simple chain problem and on the Tetris game that this  parameter acts as a bias-variance trade-off that may improve the convergence and the performance of the policy obtained.

length of the backups made to update the value function. Least-squares (LS) methods such as LSTD(0) (Bradtke & Barto, 1996), LSTD() (Boyan, 2002) and LSPE() (Nedi´ & Bertsekas, 2003; c Yu & Bertsekas, 2009) are alternative approaches for estimating the value function. They build explicitly a low-dimensional linear system that characterizes the solution TD() would converge to, but they use simulation data more efficiently, and they usually don't need a decreasing stepsize parameter which is often hard to tune. Moreover, they converge faster in practice, though each iteration has a complexity of p2 instead of p, where p is the dimension of the linear architecture. It was shown empirically (see Boyan (2002); Yu & Bertsekas (2009); Lagoudakis et al. (2002)) that LS methods are more stable and can succeed much better than TD(). LSTD() and LSPE() focus on the problem of evaluating a policy. We are interested in this article in learning a policy. The -Policy Iteration algorithm (PI), proposed by Bertsekas & Ioffe (1996), iterates over policies and uses a  paramater that introduces a bias-variance trade-off similar to the one of TD() when sampling is used to evaluate the current policy. PI generalizes the classical Dynamic Programming (DP) algorithms Value Iteration and Policy Iteration to learn a policy:  represents the size of the step made toward the value function of the policy ( = 0 corresponds to Value Iteration and  = 1 corresponds to Policy Iteration). One the one hand, when  is close to 1, the variance of the value function estimation may deteriorate the eventual performance of the obtained policy. On the other hand, reducing  introduces some kind of optimism, since we do not try to compute completely the value of the current policy before we switch to a new policy, and such a bias can degrade the convergence speed. Thus, this parameter  creates a bias-variance trade-off similar to that of TD(), and in practice, an intermediate value between 0 and 1 can

1. Introduction
We consider the question of solving Markov Decision Processes (MDPs) approximately, in the case where the value function is estimated by a linear architecture and using sampling, typically when the state space is too large for an exact solution. TD() with linear approximation (Sutton & Barto, 1998) is a fundamental algorithm of the reinforcement learning community. It estimates the value function by using temporal differences based on a sample trajectory and the  parameter controls the
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Least-Squares  Policy Iteration
bias-variance trade-off optimistic evaluation sample efficient (LS)

2. Exact case: PI
We introduce here the notations we will use and present an overview of PI in the exact case. The reader can refer to Bertsekas & Ioffe (1996) for a more detailed description of the PI method. We define an MDP as a tuple (S, A, P, R), where S is the state space, A is the action space, P is the transition function (P (s, a, s ) is the probability of getting to state s after taking action a from state s) and R is the reward function: R(s, a, s )  R is the reward received in state s after taking action a from state s. We will use the simplified notation R(s, a) to denote the expected reward of a state-action pair: R(s, a) = s S P (s, a, s )R(s, a, s ). A policy is a mapping from S to A. The value function of a policy  is the function Q : S × A  R that associates to each state-action pair (s, a) the expected value of the discounted, cumulative reward one can get from state s, taking action a and following polt icy  then: Q (s, a) = E t0  rt s0 = s, a0 = a where E denotes the expected value with respect to the policy , rt is the reward received at time t and   (0, 1) is a discount factor1 . For any vector Q, let B be the Bellman operator defined by B Q = R + P Q where R is the reward vector and P is the |S||A| × |S||A| transition matrix induced by the choice of an action and the policy  then. It is well known (see for instance Puterman, 1994) that this operator is a contraction mapping of modulus  and that its only fixed point is Q . Thus, the value function Q is the solution of the Bellman equation Q = B  Q. Let us also denote by Q the optimal value function, that is, Q (s, a) = max Q (s, a). If the optimal value function is known, it is straightforward to deduce an optimal policy by taking a greedy policy with respect to Q :   (s)  arg maxa Q (s, a). We now present PI, a method introduced by Bertsekas & Ioffe (1996) that generalizes the standard DP algorithms Value Iteration (VI) and Policy Iteration (PI). A parameter   (0, 1) specifies whether the algorithm is closer to PI ( = 1) or VI ( = 0). At each iteration, PI (see Algorithm 1) computes a value function Qk+1 from the greedy policy (k+1 ) with respect to the previous value function Qk . The value function update corresponds to a -geometric average of the terms (Bk+1 )i Qk . The bigger , the more the
1 In this paper, we only consider value functions defined over the state-action space because we are interested in learning a policy without a model of the MDP; however, our description of PI remains valid for value functions defined over the state space.

Algorithm
TD() (Sutton & Barto, 1998) LSTD(0) (Bradtke & Barto, 1996) LSTD() (Boyan, 2002) LSPE() (Yu & Bertsekas, 2009) PI (Bertsekas & Ioffe, 1996) LSPI (Lagoudakis & Parr, 2003) LSPI

× × × × × × × × × × ×

× × × ×

Figure 1. Main characteristics of related works: a trade-off parameter , the LS context, the optimistic evaluation of the value function and the off-policy evaluation.

give the best results (Bertsekas & Ioffe, 1996). When they use a  trade-off parameter, the approaches mentioned above make an on-policy estimation of the value function, that is, they need samples generated by the policy under evaluation. The offpolicy evaluation can be used in a policy iteration context without having to generate new samples when the policy changes. This idea is developed in LSPI (Lagoudakis & Parr, 2003). The difference with our approach is the fact that LSPI does not make an optimistic evaluation of the policy using a  parameter. In this article, we introduce Least-Squares  Policy Iteration (LSPI for short), a generalization of LSPI that adds the  parameter of PI. LSPI is to our knowledge the first algorithm that includes all interesting characteristics discussed above: the bias-variance trade-off, the LS framework, the optimistic evaluation of the value function and the off-policy evaluation. Figure 1 lists the related works and summarizes their characteristics to show the main contributions of LSPI. Overall, LSPI can be seen as an optimistic generalization of LSPI with a  parameter, an offpolicy, LS implementation of PI, or as an off-policy, optimistic policy iteration variant of LSPE(). The paper is organized as follows. Section 2 presents the notations and the PI algorithm in the exact case. In section 3, we consider the approximate case and detail LSPI. Section 4 finally gives some experimental results showing the interest of this parameter .

off-policy

Least-Squares  Policy Iteration

Figure 2. Visualizing PI on the greedy partition sketch: Following Bertsekas & Tsitsiklis (1996, page 226), one can decompose the value space as a collection of polyhedra, such that each polyhedron corresponds to a region where the greedy policy remains the same. PI generates a one-step move toward Qk+1 whereas VI makes small steps toward Qk+1 . PI is intermediate between PI and VI: it makes a -adjustable step toward Qk+1 .

ance reduction property of small values of . Indeed, one can see that the policy evaluation phase may be easier when  < 1 because the horizon of the sum estimated is reduced. However, PI is known to converge with an asymptotic speed that depends on  (Bertsekas & Ioffe, 1996): small values of  deteriorate the asymptotic convergence speed because they introduce a bias in the evaluation (the algorithm does not estimate the true value function of the policy anymore). This drawback is less important when using approximation and simulation since the asymptotic convergence speed is not a crucial factor.

3. Approximate case: LSPI
We have presented PI in the exact case. However, PI was designed for a context of approximation and sampling. The algorithm we introduce now, called LSPI, is an approximate version of PI which reduces to LSPI (Lagoudakis & Parr, 2003) when  = 1. LSPI is an iterative, sampling-based algorithm that computes at each iteration an estimation of the value function of the current policy, and then uses this approximation to improve the policy. As it uses value functions defined over the state-action space, knowing a model of the MDP is not required. As LSPI, LSPI is an off-policy algorithm: the policy under evaluation may be different from the one used to generate samples. Thus, a unique sample set can be reused for all iterations even if the policy changes. Approximation architecture

value function gets close to Qk+1 . An intuitive view of the value function updates made by VI, PI and PI is given in the greedy partition represented on Figure 2.

Algorithm 1 -Policy Iteration loop k+1  greedy(Qk ) Qk+1  (1 - ) i-1 (Bk+1 )i Qk
i1

end loop Bertsekas & Ioffe (1996) introduced an operator denoted Mk that gives an alternative formulation of the algorithm. At each iteration k, this operator is defined as follows: Mk Q = (1 - )Bk+1 Qk + Bk+1 Q. (1)

Intuitively, Mk can be seen as a damped application of Bk+1 . They showed that Mk is a contraction mapping of modulus  and that its only fixed point is the value Qk+1 computed by PI. Then, to calculate Qk+1 in practice, one can make successive applications of this Mk operator until the fixed point is obtained. Bertsekas & Ioffe (1996) also showed that the value function update can be seen as an incremental update Qk+1  Qk + k , where


LSPI relies on a usual linear representation of the value function. At each iteration k, we consider a policy k and an approximate value function Qk that estimates the Qk that the exact version of PI would compute. The new policy k+1 is the greedy policy with respect to Qk , then we approximate Qk+1 by a linear combination of basis functions:
p

Qk+1 (s, a) =
i=1

i (s, a)wi .

k (s0 , a0 ) = Ek+1
n=0

()n k (sn , an , sn+1 ) (2)

with k (s, a, s ) = R(s, a, s ) + Q(s , k+1 (s )) - Q(s, a). Equation (2) gives some insight on the vari-

The i terms are p arbitrary basis functions and the wi are the parameters of this architecture. As in general p |S||A| when the state space is large, such a value function requires much less resources than an exact (tabular) representation. If we denote by (s, a) the column vector of size p whose elements are the basis functions applied to (s, a), and  the |S||A| × p matrix composed of the transpose of all these vectors, we can write Qk+1 = wk+1 , where wk+1 is the parameter vector representing the approximate value function Qk+1 .

Least-Squares  Policy Iteration

To compute wk+1 , LSPI can use two standard methods. Those methods are described for instance in Lagoudakis & Parr (2003) and we generalize them to LSPI. They both try to compute an approximate value function Qk+1 which approaches the vector Qk+1 . In other words, they look for a Qk+1 such that Mk Qk+1 Qk+1 . When  = 1, we have Mk = Bk+1 and the resulting algorithm, which approximates the Bellman equation's solution, matches LSPI. 3.1. Fixed-point method (FP) As the fixed-point equation Qk+1 = Mk Qk+1 has no solution in general since Mk Qk+1 is not necessarily in the space induced by the basis functions, the fixedpoint method (FP) applies an orthogonal projection to it. This means that we seek an approximate value function Qk+1 that verifies Qk+1 = (T Dµ )-1 T Dµ Mk Qk+1 . (3)

(2003), it is possible to update A-1 incrementally by using the Sherman-Morisson formula (starting with an initial estimate I for a small ), instead of solving the system after each sample: A-1  A-1 + A-1 uv T A-1 1 + v T A-1 u (4)

where u = (s, a) and v = (s, a) - (s , k+1 (s )). The weight vector wk+1 is then computed as A-1 b, which reduces the complexity to p2 instead of p3 . 3.2. Bellman residual minimization (BR) To calculate the approximate value function Qk+1 , an alternative to FP is the Bellman residual minimization method (BR). Let us consider the (generalized) Bellman equation Qk+1 = Mk Qk+1 and the corresponding residual defined by Qk+1 - Mk Qk+1 . We seek a value function Qk+1 that minimizes the quadratic, weighted norm of this residual Qk+1 -Mk Qk+1 µ,2 where · µ,2 denotes the quadratic norm weighted by the distribu2 tion µ: Q µ,2 = s,a µ(s, a)Q(s, a) . It can be verified that the norm to minimize equals ( - Pk+1 )wk+1 - R - (1 - )Pk+1 wk
µ,2 .

Dµ represents the diagonal matrix of size |S||A| whose terms are the projection weights, denoted by µ(s, a). µ is a probability distribution over S × A. By developping Equation (3), it can be verified that wk+1 is the solution of the linear system Aw = b of size p × p with A = T Dµ ( - Pk+1 ) and

b = T Dµ (R + (1 - )Pk+1 wk ). When the number of states is large, A and b cannot be computed directly even if a model of the MDP is available. We can estimate them by using a set of L samples of the form (s, a, r , s ) where (s, a)  µ, s  P (s, a, ·) and r = R(s, a, s ). It can be seen that A = E (s, a)((s, a) - (s , k+1 (s )))T and b = E (s, a)(r + (1 - )(s , k+1 (s ))T )wk . Let us denote by A and b the sampling-based estimations of A and b. For each sample (s, a, r , s ) considered, we can update A and b as follows A  A+ 1 T (s, a) ((s, a) - (s , k+1 (s ))) , L 1 b  b + (s, a) r + (1 - )(s , k+1 (s ))T wk . L

Therefore, by a standard least-squares analysis, the parameter vector wk+1 that minimizes the Bellman residual is the solution of the p×p linear system Aw = b with A = ( - Pk+1 )T Dµ ( - Pk+1 )
T

and

b = ( - Pk+1 ) Dµ (R + (1 - )Pk+1 wk ). As in the case of FP, A and b can be estimated from a set of samples whose distribution corresponds to µ. However, BR requires for each sample to generate two possible next independant states instead of only one. This requirement is known for  = 1 (Sutton & Barto, 1998) and also applies for  > 0. Let us denote by (s, a, r , s , s ) a sample, where s and s are the results of two independent realizations of the action a from the state s, and r = R(s, a, s ) (the reward of s is not needed). Again, if (s, a) is drawn from µ, s and s from P (s, a, ·), and r = R(s, a, ·), we have A = E ((s, a) - (s , k+1 (s ))) ((s, a) - (s , k+1 (s )))T and b = E ((s, a) - (s , k+1 (s ))) (r + (1 - )(s , k+1 (s ))T wk )

To simplify these update rules, as we intend to solve 1 the linear system Aw = b, we can remove the L factor in the A and b updates without changing the solution. Once we have estimated A and b from a sample source, we solve the linear p × p system Aw = b to obtain the new parameter vector wk+1 . As in Lagoudakis & Parr

Least-Squares  Policy Iteration

As in FP, it is straightforward from these equations to build estimations A and b from the samples, by using similar incremental update rules. We can solve the p × p linear system Aw = b to get the new value function after each sample, or incrementally update A-1 using the Sherman-Morisson formula: this update is identical to that of FP (Equation 4) except that in this case u = (s, a) - (s , k+1 (s )). 3.3. Discussion For both methods, if a model of the MDP is available, we can integrate it in the update rule: the samples are then reduced to state-action pairs (s, a) and we replace the successor terms (s , k+1 (s )) and r by their respective expected values s S P (s, a, s )(s , k+1 (s )) and R(s, a). Note that for BR, when there is a model of the MDP, the double sampling issue discussed above disappears. FP and BR find different solutions in general, though when  = 0, they are equivalent (this can be seen in the formulas above) and they reduce to Fitted Value a Iteration (Szepesv´ri & Munos, 2005). When  = 1, according to the literature (Lagoudakis & Parr, 2003), FP seems to give better results. One can find examples where BR does not compute the right solution while FP does (Sutton et al., 2009). However, Schoknecht (2002) showed that FP can suffer from numerical instability (the matrix A for FP is not always invertible). LSPI iterates over policies. State-of-the-art LS policy evaluation approaches such as LSTD() (Boyan, 2002) and LSPE() (Yu & Bertsekas, 2009) could also be used in a policy iteration context to address control problems. The major difference is that LSPI is an optimistic algorithm: our algorithm switches to the next policy before the current policy is completely evaluated. In LSTD(), the value is evaluated completely and  controls the depth of the backups made from the sampled trajectories. LSPE() is an approximation of PI, applied to the evaluation of a fixed policy. It makes multiple -steps towards the value of the policy using an LS method until the policy is evaluated completely. LSPI makes only one -step and changes the policy then. Another difference between LSPE() and LSPI is the fact that they estimate different terms to make the approximate  step: LSPE() uses one or several trajectories generated with the current policy and relies on the equation Qk+1 = Qk + k (see Equation (2)) whereas LSPI considers the equation Qk+1 = Mk Qk+1 and can use off-policy samples.

3.4. Performance bound Bertsekas & Tsitsiklis (1996) provided a performance bound for the policy sequence generated by approximate policy/value iteration algorithms (the cases  = 0 and  = 1). It turns out that a similar bound applies to a large class of approximate optimistic policy iteration algorithm, of which LSPI is a specific case. For all these algorithms, if the approximation error is bounded at each iteration, then the asymptotic distance to the optimal policy is bounded: Theorem 1 (Performance bound for opt. PI) Let (n )n1 be a sequence of positive weights such that n1 n = 1. Let Q0 be an arbitrary initialization. We consider an iterative algorithm that generates the sequence (k , Qk )k1 with k+1  greedy(Qk ), Qk+1  n (Bk+1 )n Qk + k+1 .
n1 k+1 is the approximation error made when estimating the next value function. Let be a uniform majoration of that error, i.e. for all k, k   . Then

lim sup Q - Qk
k


2 . (1 - )2

The proof, which is omitted for lack of space, can be found in (Thiery & Scherrer, 2010). Approximate versions of PI such as LSPI correspond to the case where n = (1 - )n-1 for all n, but the result remains valid for any choice of coefficients n that sum to 1. Thus, the bound also applies to Modified Policy Iteration (Puterman, 1994), which consists in applying m times the Bellman operator with m fixed (i.e. we take m = 1 and n = 0 for all n = m).

4. Experiments
We now present an experimental validation of LSPI on two problems addressed by Lagoudakis et al. (2002; 2003) for LSPI: a simple chain MDP where the exact optimal value function is known, and the more challenging Tetris problem. 4.1. Chain problem We consider the exact same chain problem as in Lagoudakis & Parr (2003), that is, a chain of 20 states with two possibles actions: left (L) or right (R). Each action goes in the wanted direction with probability 0.9, and in the wrong direction with probability 0.1. When the agent gets to the left or right end of the chain, a reward of 1 is obtained. In all other cases,

Least-Squares  Policy Iteration

Figure 3. Chain problem. Evolution of Qk - Q  , distance between the current approximate value function and the optimal value function, for several values of ,  = 0.95 and with the Gaussian basis function set. Average curves of 10 runs, each run using a different set of episodes of 200 samples.

the reward is 0. The optimal value can be computed exactly. In our experiments, we compare at each iteration the current value Qk+1 to the optimal one, and the (exact) value of the current policy to the optimal value. We do not use the knowledge of the model (transitions and rewards) of the MDP and we represent the state space with the set of polynomial or Gaussian basis functions proposed by Lagoudakis & Parr (2003). 4.1.1. Influence of  We naturally observe that the value function convergence is harder when the number of samples is low, or when  is high (i.e. the horizon is bigger). When this is the case,  has an influence. Figure 3 represents the distance between the approximate value function and the optimal value function for several values of ,  = 0.95, averaged over 10 executions with different samples. The left graph corresponds to FP and the right graph to BR. As expected, we observe that for  < 1, the approximation is better because the sampling variance is reduced, and more iterations are needed to obtain this better approximation. For  = 1, after only a few iterations, the value function stops improving and oscillates between values that are relatively far from the optimal ones, compared to lower values of . Thus, intermediate values of  seem to offer the best trade-off between approximation capability and convergence speed. We actually observe, especially for BR, that it might be interesting to use a decreasing value of : indeed, in the first iterations, values of  close to 1 come faster near the optimal value function, and then, smaller values of  lead to better asymptotic convergence. One can notice that these curves are similar to the ones of Kearns & Singh (2000), which provide a formal analysis of the biasvariance trade-off of TD(). On most of the experiments we ran, FP and BR give
Figure 5. Chain problem. Evolution of Qk - Q  , distance between the value of the current policy and the optimal value, for the FP experiment of Figure 3.

similar performance, with a slight advantage for FP. We observe on this typical experiment that for FP, there is a bigger range of  values that make the value function converge to a good approximation. This confirms the discussion of section 3.3: in practice, FP seems to give better results than BR even though in theory, numeric stability problems can occur. Another observation is that when the value function does not converge, the policy oscillates with a frequency that increases with . This happens when there is a cycle in the sequence of policies. It can be seen on Figure 5, which draws Qk - Q  for the FP experiment of Figure 3. For small values of , the policy oscillates slowly because LSPI makes small steps. When  increases, the oscillations are faster as the steps are bigger (recall the intuitive view of Figure 2). For big values of , the policy does not converge anymore and oscillates with a higher frequency. For  = 0.7, we observed that the policy converged: this is all the more interesting that in such a convergent case, the performance guarantee of Theorem 1 can slightly be refined (Thiery & Scherrer, 2010).

Least-Squares  Policy Iteration

Figure 4. Chain problem. Result of FP applied with  = 0.9 for several values of  and the polynomial basis function set. Left: evolution of Qk - Q  . Right: evolution of Qk - Q  . Average curves of 10 runs, each run using a different set of episodes of 200 samples. We observe that when the policy is the same for different values of , the value seems to converge to a limit that does not depend on . We verified this property, which does not apply to BR.

4.1.2. Convergence of FP Figure 4 plots an experiment with FP where convergence is easier than in the previous example as we set  = 0.9 (other parameters are unchanged). The right graph shows the quality of the policy: it seems to converge except when  = 1. The left graph represents the distance between the current approximate value and the optimal value. After about 40 iterations, the policy becomes the same for all converging values of . Then, it seems that the value function converges to the same vector for all these values of . We can verify that for FP, if the policy and the value fonction both converge, then the limit of the value function does not depend on  (this is in general not the case for BR). If we denote by  the projection (T Dµ )-1 T Dµ , by using the definition of Mk (Equation (1)) and the facts that Qk+1 = Qk and k+1 = k , we have Qk+1 = Mk Qk+1 = ((1 - )Bk+1 Qk+1 + Bk+1 Qk+1 ) = Bk+1 Qk+1 . 4.2. Tetris Tetris is a popular video game that consists in moving and rotating different shapes to fill as many rows as possible in a 10 × 20 grid. We reproduced the experimental settings of Lagoudakis et al. (2002). Though our initial policy is the same as theirs (personal communication), the scores cannot be compared easily. The initial policy scores about 250 rows per game on our implementation, whereas they report an initial mean score of 600 rows. This may be due to Tetrisspecific implementation differences (see the review of Thiery & Scherrer, 2009). We first ran LSPI on a set of 10000 samples, as Lagoudakis et al. (2002) did for  = 1. We observed that diminishing  did not improve the performance

(it just made the convergence slower). We think that the sample set was too large to make  useful. We then tried a smaller set of training data (1000 samples instead of 10000) in order to make the convergence more difficult. Figure 6 represents the performance of the learned policies for various values of . When  = 1, the algorithm diverges: this is because there is a small number of samples. The score switches between 0 and about 600 for FP, and falls to 0 for BR. Better scores are achieved for other values of . As in the chain walk experiment,  has more influence on the performances in BR. After convergence, the best performance is reached by BR with  = 0.9 and the corresponding policy scores about 800 rows per game.

Conclusion
We introduced LSPI, an implementation of PI (Bertsekas & Ioffe, 1996) within the LS context. Our algorithm generalizes LSPI (Lagoudakis & Parr, 2003) by adding the optimistic evaluation of PI. LSPI is to our knowledge the first method that has the following four characteristics: it uses a  parameter to make a bias-variance trade-off in the value estimation phase, makes an optimistic evaluation of the policy, fits within the LS framework to makes an efficient use of training data, and iterates over policies in an off-policy way while most LS works make on-policy evaluation. It shares the virtues of LSPI: a model of the MDP is not needed (though it can be integrated when available), and as the value function is estimated off-policy, generating new trajectories or samples with each policy is not required. We provide the analytical claim that optimistic policy iteration algorithms such as LSPI are sound algorithms and we ran experiments that emphasize the influence of  in two control problems: a simple chain

Least-Squares  Policy Iteration

Figure 6. Average score of 100 Tetris games for several values of  at each iteration of LSPI. Due to the low number of training samples (1000), the algorithm diverges when  = 1.0 for both methods. When convergence is obtained, the best performance is reached by BR with  = 0.9.

walk problem and the Tetris game. Our empirical results show that intermediate values of  can give the best results in practice when the number of samples is low. This may be of particular interest in online, off-policy learning applications.

Puterman, M. Markov Decision Processes. Wiley, New York, 1994. Schoknecht, Ralf. Optimality of reinforcement learning algorithms with linear function approximation. In NIPS, pp. 1555­1562, 2002. Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesv´ri, C., and Wiewiora, a E. Fast gradient-descent methods for temporaldifference learning with linear function approximation. In ICML'09: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 993­1000, 2009. Sutton, R.S. and Barto, A.G. Reinforcement Learning, An introduction. BradFord Book. The MIT Press, 1998. Szepesv´ri, Csaba and Munos, R´mi. Finite time a e bounds for sampling based fitted value iteration. In ICML'05: Proceedings of the 22nd international conference on Machine learning, pp. 880­887. ACM, 2005. Thiery, Christophe and Scherrer, Bruno. Building Controllers for Tetris. International Computer Games Association Journal, 32:3­11, 2009. Thiery, Christophe and Scherrer, Bruno. Performance bound for Approximate Optimistic Policy Iteration. Technical report, 2010. URL http://hal.inria.fr/inria-00480952. Yu, H. and Bertsekas, D. P. Convergence Results for Some Temporal Difference Methods Based on Least Squares. IEEE Trans. Automatic Control, 54:1515­ 1531, 2009.

References
Bertsekas, D. and Ioffe, S. Temporal differences-based policy iteration and applications in neuro-dynamic programming. Technical report, MIT, 1996. Bertsekas, D.P. and Tsitsiklis, J.N. Neurodynamic Programming. Athena Scientific, 1996. Boyan, J. A. Technical update: Least-squares temporal difference learning. Machine Learning, 49:233­ 246, 2002. Bradtke, S. J. and Barto, A.G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33­57, 1996. Kearns, M. and Singh, S. Bias-variance error bounds for temporal difference updates. In In Proceedings of the 13th Annual Conference on Computational Learning Theory, pp. 142­147, 2000. Lagoudakis, M. G. and Parr, R. Least-squares policy iteration. Journal of Machine Learning Research, 4: 1107­1149, 2003. Lagoudakis, Michail G., Parr, Ronald, and Littman, Michael L. Least-squares methods in reinforcement learning for control. In In SETN'02: Proceedings of the Second Hellenic Conference on AI, pp. 249­260. Springer-Verlag, 2002. Nedi´, A. and Bertsekas, D. P. Least squares policy c evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13(1-2): 79­110, 2003.