Preconditioned Temp oral Difference Learning

Hengshuai Yao Zhi-Qiang Liu School of Creative Media, City University of Hong Kong, Hong Kong, China

hengshuai@gmail.com zq.liu@cityu.edu.hk

Abstract
This paper extends many of the recent popular policy evaluation algorithms to a generalized framework that includes least-squares temporal difference (LSTD) learning, leastsquares policy evaluation (LSPE) and a variant of incremental LSTD (iLSTD). The basis of this extension is a preconditioning technique that solves a stochastic model equation. This paper also studies three significant issues of the new framework: it presents a new rule of step-size that can be computed online, provides an iterative way to apply preconditioning, and reduces the complexity of related algorithms to near that of temporal difference (TD) learning.

set. This pattern is known as experience replay (Lin & Mitchell, 1992), and has a high efficiency in using experience because each transition is exploited to the maximum extent. However, this method may be inefficient to perform online because the data set is possible to grow extremely large if the exploration process runs for a long time1 . Another way is to extract some data structure from the sequence of experience and update the weights with the help of this structure. This way is more desirable because the data structure requires much smaller size of memory than the data set, and leads to recent popular algorithms such as leastsquares temporal difference (LSTD) (Boyan, 1999) and least-squares policy evaluation (LSPE) (Nedi´ & Bertc sekas, 2003). Compared to TD, the two algorithms are more data efficient, but similar to the experience replay, are still computationally expensive. LSTD inverts some accumulated matrix per time step, and generally requires O(K 3 ). Recursive LSTD (RLSTD) computes the inversion of LSTD's matrix iteratively using Sherman-Morison formula and reduces the complexity to O(K 2 ) (Bradtke & Barto, 1996)(Xu et al., 2002). LSPE is similar to LSTD and will be examined later. Recently incremental LSTD (iLSTD) was proposed to strike a balance between LSTD and TD (Geramifard et al., 2006a)(Geramifard et al., 2006b): its data efficiency is almost as good as LSTD, but its computational cost is very near to that of TD. However, iLSTD still requires tuning the step-size manually as TD. In contrast, LSTD has no parameter to tune. The aim of this paper is to explore the relations among recent popular policy evaluation algorithms. A framework of policy evaluation algorithms called the preconditioned temporal difference (PTD) learning is introduced, which includes LSTD, LSPE and a variant of iLSTD, etc. Furthermore, we maintain LSTD's prop1 Lin avoided this problem by using only a window of most recent experience (Lin & Mitchell, 1992). This, however, results in loss of information and it is in general difficult to prespecify the window size.

1. Intro duction
In Reinforcement Learning (RL), a primary concern is how to reuse experience in an intelligent and fast way. To achieve this we must consider two ma jor issues, namely, the data efficiency and the computational efficiency. Recently these two issues were widely studied by the research on temporal difference (TD) leaning. TD is a classical algorithm well suited for policy evaluation (Sutton, 1988), and achieves great success for its wide applications in control and AI games (Sutton & Barto, 1998). One of its significant advantages is its superior computational efficiency. If the feature vector has K components, TD requires O(K ) complexity. However, previous research shows that TD uses experience inefficiently (Lin & Mitchell, 1992)(Geramifard et al., 2006a). The reason is that TD throws the transition information away after using it for one update of weights. One way to reuse this information is to accumulate it into a data set once it has been experienced. Then TD methods are repeatedly applied to the data
Appearing in Proceedings of the 25 International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).
th


Preconditioned Temp oral Difference Learning

erty of ease of tuning. This paper proposes an adaptive step-size that can be computed online by all PTD algorithms. To reduce computational complexity O(K 2 ) of the PTD algorithms due to the inversion of the preconditioner matrix, we develop an efficient incremental process to apply preconditioning, which leads to a set of incremental PTD algorithms. Incremental PTD algorithms take advantage of the sparse nature of RL tasks, storing and manipulating only the valid experience by a condensed structure every time step. We also present an efficient adaptive step-size for incremental PTD algorithms. Results on Boyan chain example show that PTD algorithms using the adaptive step-size gives much faster convergence than using previous rule of step-size; incremental PTD algorithms via condensed implementation have a high efficiency in using data while a low complexity similar to iLSTD. 1.1. Stationary Mo del Equation Given a state space S = {1, 2, . . . , N }, the problem of policy evaluation is to predict the long-term optimal reward of a policy for each state s: J (s) = t  t r(st , st+1 ), s0 = s, 0 <   1,

iLSTD(). Equation (1) is only useful for analysis but not applicable in practice and will be called the stationary model equation. 1.2. Law of Large Numb ers A common structure grown by LSTD(), LSPE() and iLSTD() is updated incrementally. If the current transition from st to st+1 incurs a reward rt , then a matrix and a vector are updated by ~ ~ At+1 = At + zt ( t+1 - t ) ; ~t+1 = ~t + zt rt , b b

where zt is the eligibility trace, computed recursively by zt+1 =  zt + t+1 . Because the components of ~ At+1 and ~t+1 can get to infinity it is better to use b some well-defined term. For infinite-horizon problems, (Tadic, 2001)(Nedi´ & Bertsekas, 2003) used the fol´ c lowing structure At+1 = 1~ At+1 ; t+1 bt+1 = 1~ bt+1 , t+1 (2)

which satisfies the law of large numbers. However, (2) are no longer consistent estimations of A and b for absorbing Markov chains such as Boyan chain example. In Section 2.1, such an extension is proposed. 1.3. Related Algorithms At time t, the rule of LSTD() for updating weights ~b ~ can be specified as wt+1 = -A-11~t+1 . In practice, At t+ ~0 to can be singular and a perturbation which sets A - - 0 I (0 < 0) should be used. LSPE() was proposed for infinite-horizon problem (Nedi´ & Bertsekas, 2003). If the current step-size c is t , LSPE updates w by wt+1 = wt + t (Dt+1 ) where Dt+1 1 = t+1 
+ 0I -1

=0

where  is the discount factor, and r(st , st+1 ) is the reward received by the agent at time t. Given K (K  N ) feature functions k (·) : S  R, k = 1, . . . , K, the feature of state st is t = [1 (st ), 2 (st ), . . . , K (st )] . The optimal reward vector J can now be approximated ^ by J = w, where w is the weight vector, and  is the feature matrix whose entries are (j, k ) = k (j ), k = 1, . . . , K ; j = 1, . . . , N . For an ergodic Markov chain that has steadystate probabilities  (1),  (2), . . . ,  (N ), (Tsitsiklis & Van Roy, 1997) proved that TD() eventually finds a weight vector w that satisfies a linear system Aw = -b, where A and b are defined by A =  D( P - I ) k ( P )k , b =  D k ( P )k r, ¯


(At+1 wt + bt+1 ) , ,

(3)

+

(1)

kt

k  k

+ 0 > 0.

=0

In the long run, Dt+1 converges to  D. 1.4. Preconditioning Generally, solutions to a linear system like (1) can be categorized into two classes. The first is the direct methods (Saad, 2003), which factorize A into easily invertible matrices, including Gaussian elimination and LU/QR factorization, etc. However, the complexity involved in factorizing A is not practical for large scale systems. The second class, known as the iterative solutions, scales well with the problem size and is very efficient for large and sparse linear systems.

=0

=0

D is the diagonal matrix with diagonal entries  (i), i = 1, . . . , N ;   [0, 1] is the eligibility trace factor; P is the transition probability matrix; I is the identityNmatrix; and r is the vector with components ¯ ri = j =1 Pi,j r(i, j ), i = 1, . . . , N . For each   [0, 1], ¯ w is also the limit point of LSTD(), LSPE() and


Preconditioned Temp oral Difference Learning

According to the literature of iterative solutions, preconditioning is especially effective for symmetric system (Saad, 2003), but usually for RL tasks, matrix A is not symmetric. Therefore, the original stationary model equation is first transformed into the following symmetric form A Aw = -A b,
  

where T is the number of all observed state visits in M tra jectories, and zt is the eligibility trace. Similarly, for absorbing Markov chain, LSPE should use the following structure to estimate  D: DM =
ML 1 m tm t  . t T =1 =0

(9)

(4)

which can be solved by Richardson's iteration (Saad, 2003) w +1 = w - A (Aw + b) , (5) where  is some positive scalar that should satisfy (I - A A) < 1. The technique of preconditioning refers to a general technique which preconditions a system before solving it. For example, preconditioning (4) gives us the preconditioned symmetric model equation C -1 A Aw = -C -1 A b, where C is an invertible matrix, usually called the preconditioner. Then the model equation is solved by the iteration w +1 = w - C -1 A (Aw + b) . (6)

On a transition from st to st+1 , estimations (7), (8) and (9) can be updated incrementally, which is achieved by a Robbins-Monro (RM) procedure: At+1 = At + 1 (zt ( t+1 - t ) - At ), T 1 (zt rt - bt ), T (10) (11)

bt+1 = bt + and Dt+1 = Dt +

1 (t  - Dt ), (12) t T where T is updated by T = T + 1 after the three estimations. The convergence of RM procedure can be proved in similar manner to the case of infinite-horizon problems (Tadic, 2001)(Nedi´ & Bertsekas, 2003). ´ c 2.2. The Framework Given At+1 and bt+1 estimated by RM, we can define a stochastic model equation At+1 w = -bt+1 . Because RM estimations have some error, the stochastic model equation is not satisfied, and there exists a nonzero residual vector et+1 (w) = At+1 w + bt+1 . (13)

Convergence rate of (6) is governed by the spectral radius of I - C -1 A A: the smaller the radius is, the faster the solution will be (Saad, 2003). Therefore a good preconditioner should make the preconditioned radius smaller than the original radius, i.e., (I - C -1 A A) < (I - A A).

2. The Generalized Framework
We first give consistent estimations of A and b for absorbing Markov chains, and then we show how to apply them together with preconditioning to policy evaluation. 2.1. Robbins-Monro for Absorbing Chains A tra jectory of an absorbing Markov chain is a finite sequence s0 , . . . , sq , where sq is the absorbing state. Given tra jectories 1, . . . , M , where the mth tra jectory has length Lm , the consistent estimations of A and b are ML 1 m tm zt ( t+1 - t ) , (7) AM = T =1 =0 and bM
ML 1 m tm = zt rt , T =1 =0

A natural idea is that the current weights can be improved by minimizing the residual error ||et+1 (w)||2 , which produces a gradient descent algorithm wt+1 = wt - t A +1 (At+1 wt + bt+1 ), t where t is a positive step-size. Gradient descent algorithm is a stochastic form of the iteration (5). The general preconditioned temporal difference (PTD) learning applies the technique of preconditioning to improve the convergence rate of gradient descent. Assume Ct+1 is a chosen preconditioner, the rule of PTD can be cast as
-1 wt+1 = wt - t Ct+1 A +1 (At+1 wt + bt+1 ), t

(14)

(8)

where t is some scalar but not necessarily positive. With the rule proposed in Section 3, the step-size guarantees the convergence of PTD algorithms and makes


Preconditioned Temp oral Difference Learning

them more flexible in stochastic environments than (6). The choice of preconditioner is a key issue. Generally, preconditioner should decrease the spectral radius of gradient descent: (I -
-1 t Ct+1 A +1 At+1 ) t

model equation, naturally we hope that the new presidual error is smaller than the old one: ||t+1 ||2 < ||t ||2 . This can be guaranteed by requiring that t+1 be orthogonal to t+1 - t . Accordingly, we obtain a new rule of step-size t =
 -1 t Bt+1 At+1 t -1 -1 (Bt+1 At+1 t ) (Bt+1 At+1 t )

< (I -

t A +1 At+1 ). t

.

(16)

Gradient descent makes no preconditioning because it chooses the identity matrix as preconditioner. Good examples of preconditioner can be found in recent popular policy evaluation algorithms. 2.3. Relations to Previous Algorithms LSTD, LSPE and iLSTD are all special forms of applying preconditioner to gradient descent algorithm: Ct+1 = -A +1 Dt+1 , where Dt+1 is defined in (12). t One can easily verify that this is a variant of LSPE().
 Ct+1 = At+1 At+1 . This is an extended form of LSTD: wt+1 = (1 - t )wt + t (-A-11 bt+1 ). Using t+ 1 as the step-size, we get exactly LSTD(). Later we will see that LSTD is optimal in choosing its step-size because certain residual error is minimized.

This step-size is the optimal value that minimizes the new p-residual error over   R, i.e., t = arg min ||t+1 ||2 . Obviously the step-size is positive -1 when Bt+1 At+1 is positive definite, which is true for gradient descent algorithm, iLSTD, and LSTD. It is interesting that in (16), if Bt+1 = At+1 , then the step-size is equal to 1. This indicates that LSTD's choice of step-size is optimal in the sense that the p-residual error ||(1 - t )(wt + A-11 bt+1 )||2 is minit+ mized. When Bt+1 = -I , the residual error ||(I + t At+1 )et+1 (wt )||2 is minimized; for ease of later comparisons with previous step-size of iLSTD, this variant will be called the Minimal Residual (MR) algorithm. To compute p-residual vector and step-size, PTD algorithms have to carry out matrix inversion, which requires O(K 3 ). Sherman-Morison formula is a solution to reduce this complexity to O(K 2 ). Another efficient solution is to apply preconditioning incrementally and take advantage of the sparse nature of RL tasks.

Ct+1 = -A +1 . This approach is a variant of t iLSTD() (Yao & Liu, 2008).

3. The Rule of Step-size
This section presents an adaptive process to compute the step-size online for gradient descent algorithm, LSTD, iLSTD and LSPE. The four algorithms all use a preconditioner in the form of A +1 Bt+1 , which is ast sumed to be used by general PTD algorithms. The update direction of PTD is provided by a preconditioned residual (p-residual) vector
-1 t = Bt+1 et+1 (wt ),

4. Incremental PTD
The key of incremental preconditioning is to approximate the p-residual and the adaptive step-size in an iterative way. 4.1. Iterative P-residual and Approximated Step-size Let t be the error caused by the residual and the current iterative p-residual, defined by ^ t = et+1 (wt ) - Bt+1 t . The new estimation of t can be improved by ^ ^ t+1 = t - t t , where t is computed by t = -  (Bt+1 t ) t . (Bt+1 t ) (Bt+1 t ) (19) (18) (17)

(15)

where et+1 is the residual vector defined in (13). This p-residual is an "old" one, because it is obtained before the weight update. After the weight update, the presidual vector changes to t+1 =
-1 Bt+1 et+1 (wt+1 ).

From (14) and (15), t+1 can be rewritten as t+1
-1 = Bt+1 (At+1 (wt - t t ) + bt+1 ) -1 = t - t Bt+1 At+1 t .

Because t+1 stands for an improved difference between the two sides of the preconditioned stochastic

Substituting the p-residual t in (14) with the iterative ^ p-residual t , we get the general form of incremental PTD algorithms wt+1 = wt - t t , ^^ (20)


Preconditioned Temp oral Difference Learning

Algorithm 1 Efficient matrix-vector multiplication using CSR.
t+ Input: ZQ 1 (a, c, d) and a vector t Output: A vector ot = Qt+1 t for k = 1 to K do k1 = dt+1 (k ) k2 = dt+1 (k + 1) - 1 ot (k ) = at+1 (k1 : k2 ) t (ct+1 (k1 : k2 )) end for

Figure 1. Boyan chain example with N + 1 states. The transition probabilities are marked on the arch.

where t is computed by the following steps. ^ ^ Given the iterative p-residual t , steps (21a)­(21d) compute a vector v , which is an approximation of -1 Bt+1 At+1 t ; then the approximated step-size is computed by (21e): ^ t = At+1 t , (21a) t = t - Bt+1 vt , t =  (Bt+1 t ) t , - (Bt+1 t ) (Bt+1 t ) vt+1 = vt - t t , t = ^ ^ t vt+1 .  vt+1 vt+1 (21b) (21c) (21d) (21e) details are shown by Algorithm 1, whose complexity is O(lt+1 ), where lt+1 is the number of nonzero entries in Qt+1 . Now (17), (19), (21a), (21b) and (21c) can make a call to Algorithm 1, and the complexity of incremental PTD is given by the following theorem which is proved in (Yao & Liu, 2008). Theorem 4.2.1 (Complexity of incremental PTD). The per-time-step complexity of incremental PTD using CSR is O(q K ), where q is a smal l positive real related to the sparsity of matrix A.

5. Boyan Chain Example
Boyan chain and the features are shown in Figure 1. Transition from N to N + 1 incurs a reward -2; transition from N + 1 incurs 0; the other transitions incur -3. The discount factor  is set to 1. The first experiment is another implementation of experience replay. As RM procedure is able to extract compressed experience information by estimations of A and b, it is natural to ask whether experience can be well replayed by repeatedly presenting RM's estimations to PTD algorithms. Two questions arise for this approach. Will it lead to convergence? What is the role of  for PTD()? RM() were first run and averaged over 10 sets of 10000 tra jectories, and then their estimated structures were repeatedly presented to Gradient descent(GRAD()), iLSTD(), LSPE() and LSTD(). All compared algorithms used the adaptive step-size derived in Section 3, and converged to satisfactory solutions for a variety of problem sizes with all   [0, 1]. The case of N = 12 is shown in Figure 2. It is very interesting that for all PTD algorithms smaller  gives better performance;  = 0 performs best and  = 1 performs worst, --exactly the same role with that for TD() under repeated presentation training paradigm (Sutton, 1988). Explanation can be given if we view TD as a model exploration process: although TD does not use the model A and b explicitly, its learn-

If Bt+1 = -I , iterative p-residual reproduces presidual exactly and we get MR(); If Bt+1 = -Dt+1 , we get an incremental form of LSPE() (iLSPE()) that applies preconditioning via iterative p-residual. 4.2. Incremental PTD Using CSR If the function approximation used is sparse, then matrix  is sparse. While this seems a restrictive condition, several popular linear function approximation schemes such as lookup table, Boyan's linear interpolation approximation and tile coding (Sutton & Barto, 1998), are indeed sparse. If the transition matrix is also sparse, matrices At+1 and Bt+1 will both have many zero entries, implying that "no experience is available for the states related to these entries". Therefore, it is better to remove the void experience and store only the valid experience by a condensed structure. Here the Compressed Sparse Row (CSR) format (Saad, 2003) is used. Let Qt stand for At or t Bt . The CSR format is a triplet ZQ (at , ct , dt ), where at is a real array containing all the real values of the nonzero elements of Qt ; ct is an integer array containing the column indices of the elements stored in at ; and dt is an integer array containing the pointers to the beginning of each row in at and ct . When Qt+1 is sparse, the need for fast matrix-vector multiplication offers a place where CSR fits in. The


Preconditioned Temp oral Difference Learning
x 10
-3 -1

3.2 3

10

Model Errors

RMS Errors

2.8 2.6 2.4 2.2 2

10

-2

b' error

10

-3

A' error

0

0.2

0.4


0.6

0.8

1

10

-4

0

0.2

0.4


0.6

0.8

1

Figure 2. Effects of : the (same) RMS errors by GRAD(), MR(), LSPE() and LSTD().

Figure 3. Model errors by RM() procedures.

ing requires exploring and sampling temporal values of the model. It appears that both TD and PTD algorithms rely on the model data reflected by the sets of tra jectories. To explore 's effect for algorithms, we only have to study its role for the model data, which is extracted by RM procedure. Results are shown in Figure 3, where the model errors are measured by ||AT -A||2 and ||bT - b||2 , averaged over 10 sets of 10000 tra jectories. It can be observed that smaller  has smaller modeling errors for A and b, --the role of  for RM() is just what should be consistent with that for TD() and PTD(). Although all PTD algorithms converge to the same solution, their rates of convergence are quite different. The case of  = 1 is shown in Figure 4. MR, LSPE and LSTD are faster than Gradient descent because they make use of preconditioning and their spectral radii are smaller than that of Gradient descent. Figure - 5 compares the spectral radii (I - t Ct 1 A At ) of t different algorithms. Algorithms were also compared under the same learning paradigm as Boyan (Boyan, 1999), where weights were updated immediately after RM estimations at each time step. All compared algorithms used the adaptive rule except that iLSPE used the approximate step-size developed in Section 4.1. For both adaptive step-size and approximated step-size, a satisfactory convergence was obtained. Results are shown in Figure 6 and Figure 7, where each point was the averaged

RMS error over 10 sets of data. It is clear that some intermediate value of  performs best in both learning error and convergence rate for all algorithms. Generally, four preconditioned algorithms learns faster than Gradient descent algorithm. However, the convergence rate advantages of MR(), iLSPE(), LSPE() and LSTD() over Gradient descent are becoming smaller as  increases. The reason may be that larger  causes larger model error and deteriorates the effects of preconditioning. Experiment was also run to compare the adaptive stepsize with the rule used by (Geramifard et al., 2006a), which takes t = c0 (c1 +1) , where c0 was chosen from tr aj #+c1 {0.01, 0.1, 1}, and c1 was chosen from {100, 1000, 106 }. The best performance of all the nine combinations of the two constants was experimentally chosen for iLSTD. Figure 8 shows that RMS error of MR (adaptive step-size) is faster to decrease than that of iLSTD. From Figure 8, we can also observe that PTD's predictions (such as those given by LSTD and LSPE) have larger variations than incremental PTD's (such as those given by MR and iLSPE). The reason is that PTD algorithms are based on the inversion of preconditioner, which is not well conditioned at the beginning stage of learning; while incremental PTD algorithms avoid numerical instability via iterative p-residual. Table 1 compares the complexity of PTD and incremental PTD algorithms, where CSR are used for MR (one CSR for At ) and iLSPE (one CSR for At and one CSR for Dt ). We can see that incremental PTD algorithms have a clear computational advantage over


Preconditioned Temp oral Difference Learning
2

10

10

1

GRAD MR LSPE LSTD

0.8 0.7 Spectral Radii 0.6 0.5 0.4 0.3 0.2 0.1 GRAD MR LSPE 5 10 Number of Replay Times 15

RMS Errors

10

0

10

-1

10

-2

10

-3

0
5 10 15 Number of Replay Times 20

0

0

Figure 4. The role of preconditioner ( = 1). Algorithms are stopped if RMS error is smaller than 0.01.

Figure 5. Spectral radius comparisons ( = 1). LSTD's spectral radius is 0 permanently, thus not shown.

Table 1. Comparison of per-time-step running time (ms) of PTD and incremental PTD algorithms on K = 401 ( = 0). The machine used is Pentium(R) 4 PC (CPU 3.00GHZ; RAM 1.00GB).

TD. We also develop an adaptive process for computing the step-size online for PTD algorithms, and an approximated process for computing the step-size for incremental PTD algorithms.

LSTD 72.3

LSPE 120.3

iLSTD 5.9

MR 10.6

iLSPE 21.9

Acknowledgement
We are thankful to Lihong Li, George Konidaris and Andrew Barto for helpful discussions with a draft of this paper. We appreciate for the suggestions by the four reviewers that help improve this paper in many aspects. This research has been partially supported by RGC CERG grant No. CityU 1178/06 (9041147) from Hong Kong UGC.

Table 2. Comparison of memory requirements for A using CSR and full matrix for a variety of problem sizes ( = 0).

N
l K2

12 0.75

100 0.1479

400 0.04

800 0.0198

1200 0.0132

1600 0.01

PTD algorithms. Reason lies in that CSR enables incremental PTD to manipulate much smaller size of data than PTD. Table 2 shows the relative memory requirements of CSR and full matrix for a variety of problem sizes by the ratio l/K 2 , where l is the nonzero entries of A. We can observe that the larger the size of state space is, the more advantages will be gained by using CSR.

References
Boyan, J. A. (1999). Least-squares temporal difference learning. Proceedings of the Sixteenth International Conference on Machine Learning (pp. 49­56). Morgan Kaufmann. Bradtke, S., & Barto, A. G. (1996). Linear leastsquares algorithms for temporal difference learning. Machine Learning, 22, 33­57. Geramifard, A., Bowling, M., & Sutton, R. S. (2006a). Incremental least-squares temporal difference learning. Twenty-First National Conference on Artificial Intel ligence (AAAI-06) (pp. 356­361). AAAI Press. Geramifard, A., Bowling, M., Zinkevich, M., & Sutton, R. S. (2006b). iLSTD: Eligibility traces and

6. Conclusion
In this paper we proposed two general frameworks, PTD and incremental PTD, which are more data efficient than TD. Generally PTD approaches such as LSTD and LSPE are computationally expensive, whereas incremental PTD algorithms such as MR and iLSPE can take advantage of sparse nature of RL tasks, and have complexity near to that of iLSTD and


Preconditioned Temp oral Difference Learning

0.0181 0.018

RMS Errors

RMS Errors

0.0179

GRAD() MR() iLSPE() LSPE() LSTD()

50 iLSTD MR iLSPE LSPE LSTD

40

30

0.0177

20

0.0175

10

0.0174 0 0.2 0.4  0.6 0.8 1

0 0 10

10 10 Number of State Transitions

1

2

Figure 6. The RMS errors at 90000th visit by GRAD(), MR(), iLSPE(), LSPE() and LSTD().

0.148 0.147 0.146 0.145 0.144 0.143 0.142

Figure 8. Averaged RMS errors (over 10 × 10000 tra jectories) of iLSTD, MR, iLSPE, LSPE and LSTD in Boyan's setting ( = 0). Perturbation factors of LSTD and LSPE were -0.01 and 0.01 respectively. Weights for all algorithms were initialized to [0.1, 0.1, 0.1, 0.1] except for LSTD.

RMS Errors

GRAD() MR() iLSPE() LSPE() LSTD()

proximation. Journal of Discrete Event Systems, 13, 79­110. Saad, Y. (2003). Iterative methods for sparse linear systems. SIAM. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9­44. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT Press. Tadic, V. (2001). On the convergence of temporal´ difference learning with linear function approximation. Machine Learning, 42, 241­267. Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42, 674­690. Xu, X., He, H., & Hu, D. (2002). Efficient reinforcement learning using recursive least-squares methods. Journal of Artificial Intel ligence Research, 16, 259­ 292. Yao, H., & Liu, Z. (2008). Preconditioned temporal difference learning (Technical Report CityU-SCMMCG-0408). City University of Hong Kong.

0

0.2

0.4


0.6

0.8

1

Figure 7. Comparison of convergence rate in terms of RMS errors at 900th state visit.

convergence analysis. Advances in Neural Information Processing Systems 19 (pp. 441­448). Lin, L.-J., & Mitchell, T. M. (1992). Memory approaches to reinforcement learning in nonmarkovian domains (Technical Report CMU-CS-92138). Carnegie Mellon University, Pittsburgh, PA 15213. Nedi´, A., & Bertsekas, D. P. (2003). Least-squares c policy evaluation algorithms with linear function ap-