Convergence of Least Squares Temporal Difference Methods Under General Conditions
Huizhen Yu Department of Computer Science, University of Helsinki, Finland

janey.yu@cs.helsinki.fi

Abstract
We consider approximate policy evaluation for finite state and action Markov decision processes (MDP) in the off-policy learning context and with the simulation-based least squares temporal difference algorithm, LSTD(). We establish for the discounted cost criterion that the off-policy LSTD() converges almost surely under mild, minimal conditions. We also analyze other convergence and boundedness properties of the iterates involved in the algorithm, and based on them, we suggest a modification in its practical implementation. Our analysis uses theories of both finite space Markov chains and Markov chains on topological spaces.

simulation-based methods for large-scale dynamic programming (Glynn & Iglehart, 1989). The algorithm we consider in this paper, the offpolicy least squares temporal difference (TD) algorithm, LSTD(), is one of the exploration-enhanced methods for policy evaluation. More specifically, we consider discounted total cost problems with discount factor  < 1. We evaluate the so-called Q-factors of the target policy, which are essential for policy iteration, and which are simply the costs of the policy in an equivalent MDP that has as its states the joint stateaction pairs of the original MDP1 [see e.g., (Bertsekas & Tsitsiklis, 1996)]. This MDP will be the focus of our discussion, and we will refer to Q-factors as costs for simplicity. Let I = {1, 2, . . . , n} be the set of stateaction pairs indexed by integers from 1 to n. We assume that the behavior policy induces an irreducible Markov chain on the space I of state-action pairs with transition matrix P , and that the target policy we aim to evaluate would induce a Markov chain with transition matrix Q. We require naturally that for all states, possible actions of the target policy are also possible actions of the behavior policy. This condition, denoted Q P , can be written as pij = 0  qij = 0, i, j  I. (1)

1. Overview
We consider approximate policy evaluation for finite state and action Markov decision processes (MDP) in an exploration-enhanced learning context, called "offpolicy" learning. In this context, we employ a certain policy called the "behavior policy" to adequately explore the state and action space, and using the observations of costs and transitions generated under the behavior policy, we may approximately evaluate any suitable "target policy" of interest. This differs from the standard policy evaluation case ­ "on-policy" learning ­ where the behavior policy always coincides with the policy to be evaluated. The dichotomy between the off-policy and on-policy learning stems from the exploration-exploitation tradeoff in practical modelfree/simulation-based methods for policy search. With their flexibility, off-policy methods form an important part of the model-free learning methodology (Sutton & Barto, 1998) and have been suggested as important
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Let g be the vector of expected one-stage costs g(i) under the target policy. The cost J  of the target policy satisfies the Bellman equation J  = g + QJ  . (2)

With the temporal difference methods (Sutton, 1988) [see also the books (Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998; Bertsekas, 2007; Meyn, 2007)], we
The equivalent MDP on the space of state-action pairs can be defined as follows. Consider any two state-action pairs i = (s, u) and j = (^, v). Suppose a transition from s s to s under action u incurs the cost c(s, u, s) in the original ^ ^ MDP. Then the cost of transition from i to j in the equivalent MDP can be defined as g(i, j) = c(s, u, s). The tran^ sition probability from i to j under a policy which takes action v at state s with probability ľ(v | s) is given by ^ ^ p(^ | s, u)ľ(v | s). s ^
1

Convergence of LSTD() Under General Conditions

approximate J  by the solution of a projected multistep Bellman equation J = T () (J) (3)

The vector bt and matrix Ct aim to approximate the quantities defining the projected Bellman equation (3). The solution rt of the equation Ct r + bt = 0 is used to give rt as an approximation of J  at time t.2 The on-policy case corresponds to the special case P = qi i Q. There, all the ratios pit-1 it appeared above in Zt t-1 t and Ct become 1, and the algorithm reduces to the on-policy LSTD algorithm as first given by (Bradtke & Barto, 1996) for  = 0 and (Boyan, 1999) for   [0, 1]. In the off-policy case, a property of practical imporq tance is that the ratios pij are determined by the ratios ij between the action probabilities of the target and the behavior policies (as can be seen from Footnote 1); therefore, they need not be stored and can be calculated on-line in the algorithm. A full convergence analysis of the off-policy LSTD() algorithm does not exist in the literature, to our knowledge. The almost sure convergence of the algorithm (i.e., convergence with probability one) in special cases has been studied. A proof for the on-policy case can be c found in (Nedi´ & Bertsekas, 2003). A proof for the offq policy case under the assumption that  maxi,j pij < ij 1 (with 0/0 treated as 0) is given in (Bertsekas & Yu, 2009); this covers the on-policy case as well as the offpolicy LSTD() for  close or equal to 0, but for a general value of , the condition is too stringent on either the target or the behavior policy. Note that the case with a general value of  is important in practice, because using a large value of  not only improves the quality of the approximation from the projected Bellman equation, but also avoids potential pathologies regarding the existence of solution of the equation (as  approaches 1, T () becomes a contraction mapping, ensuring the existence of a unique solution). In this work, we establish the almost sure convergence of the sequences {bt }, {Ct }, as well as their convergence in the first mean, under the general conditions given at the beginning, namely, the irreducibility of P and Q P . Our results imply that the off-policy LSTD() solution rt converges to the solution r of the projected Bellman equation (3) almost surely, whenever Eq. (3) has a unique solution (if (3) has multiple solutions, then any limit point of {rt } is a solution of it.) As will be seen later, the convergence of {bt }, {Ct } in the first mean (Theorem 1) can be established using arguments based on finite space Markov chains, while the proof of the almost sure
In this paper we do not discuss the exceptional case where Ct r + bt = 0 does not have a solution. Our focus will be on the asymptotic properties of the sequence of equations Ct r + bt = 0 themselves.
2

involving a multistep Bellman operator T () parametrized by   [0, 1], whose exact form will be given later. Here  is a linear operator of projection onto an approximation subspace {r | r  nr }  n with respect to a weighted Euclidean norm, where  is an n × nr matrix whose columns span the approximation subspace and whose rows are often called "features" of states/actions. In the case considered here, we take the weights in the projection norm to be the steady-state probabilities of the Markov chain induced by the behavior policy. The projected Bellman equation (3) is equivalent to a low dimensional equation on nr , and its solution r (when it exists) is used to approximate the cost J  of the target policy. The off-policy LSTD() algorithm that we will analyze aims to construct the low-dimensional equivalent of the projected equation (3) by using observations generated under the behavior policy. The algorithm takes into account the discrepancies between the behavior and the target policies by properly weighting the observations. The technique is based on importance sampling, which is widely used in dynamic programming and reinforcement learning contexts [see e.g., (Glynn & Iglehart, 1989; Sutton & Barto, 1998; Precup et al., 2001; Ahamed et al., 2006)]. The off-policy LSTD() algorithm we will analyze was first given by (Bertsekas & Yu, 2009) in the general context of approximating solutions of linear systems of equations. The form of the algorithm bears similarities to other offpolicy TD() algorithms, e.g., the episodic off-policy TD() (Precup et al., 2001), as well as to the on-policy LSTD() counterpart (Bradtke & Barto, 1996; Boyan, 1999). The algorithm can be described as follows. Let (i0 , i1 , . . .) be an infinitely long state trajectory of the Markov chain with transition matrix P , generated under the behavior policy. Let (i) denote the transpose of the ith row vector of matrix , and let g(i, j) be the per-stage cost of transition from state i to j. The off-policy LSTD() method computes low-dimensional vector iterates Zt , bt and matrix iterates Ct as follows: with (z0 , b0 , C0 ) being the initial condition, for t  1, Zt =  pit-1 it ˇ Zt-1 + (it ),
t-1 t

qi

i

(4) (5)

bt = (1 - Ct = (1 -
1 t+1 Zt

1 1 t+1 )bt-1 + t+1 Zt g(it , it+1 ), 1 t+1 )Ct-1 +

 pit it+1 ˇ (it+1 ) - (it )
t t+1

qi

i

.

(6)

Convergence of LSTD() Under General Conditions

convergence is not so straightforward and finite space Markov chains-based arguments are no longer sufficient. In contrast to the relative simplicity of the onpolicy case, the technical complexity here is partly due to the fact that the sequence {Zt } cannot be bounded a priori. Indeed, we can show that for the off-policy case, in fairly common situations, {Zt } is almost surely unbounded (Prop. 2). Neither does it seem likely that without imposing extra conditions, the sequence of Zt can have bounded variance. Nevertheless, these do not preclude the almost sure convergence of the off-policy LSTD() algorithm, as we will show. It is worth mentioning that the study of the almost sure convergence of the off-policy LSTD() is not solely of theoretic interest. Various TD algorithms other than LSTD() use the same approximations bt , Ct to build approximating models [e.g., preconditioned TD() (Yao & Liu, 2008)] or fixed point mappings [e.g., LSPE(), see (Bertsekas & Yu, 2009); and (Bertsekas, 2009)] needed in the algorithms. Therefore in the off-policy case, the asymptotic behavior of these algorithms on a sample path depends on the mode of convergence of {bt }, {Ct }, and so does the interpretation of the approximate solutions generated by these algorithms. For algorithms whose convergence relies on the contraction property of mappings, (for instance, LSPE()), the almost sure convergence of {bt }, {Ct } on every sample path is critical. Furthermore, the mode of convergence of the off-policy LSTD() is also relevant for understanding the behavior of other off-policy TD algorithms, e.g., the non-episodic off-policy TD() and episodic off-policy TD() with very long episodes, which, although not computing directly bt , Ct , implicitly depend on the convergence properties of {bt }, {Ct }. To establish the almost sure convergence of {bt }, {Ct }, we will study the Markov chain {(it , Zt )} on the topological space I × nr . Again, the lack of boundedness condition on Zt makes it difficult to argue the existence of an invariant probability measure by constructing explicitly the form of Zt for a stationary Markov chain {(it , Zt )} in the limit, as can be done in the on-policy case (Tsitsiklis & Van Roy, 1997). We will use the theory of e-chains (Meyn & Tweedie, 2009), which concerns topological space Markov chains with equicontinuous transition kernels, to establish two main results: (i) the Markov chain {(it , Zt )} has a unique invariant probability measure and is ergodic (Theorem 2), and (ii) the almost sure convergence of {bt }, {Ct } (and hence the almost sure convergence of the off-policy LSTD() algorithm) (Theorem 3). The first ergodicity result is indeed stronger than what is needed to show (ii); but it sheds light on the nature of the TD

iterates and provides a basis for analyzing other offpolicy TD() algorithms in the future. Let us also mention the ODE proof approach: relevant here is the mean-ODE method [see e.g., (Kushner & Yin, 2003; Borkar, 2008)], which, however, requires the verification of conditions that in our case would be tantamount to the almost sure convergence conclusion we want to establish. The paper is organized as follows. We specify notation and definitions in Section 2. We present our main results and outline their key proof arguments in Section 3. We then give proof details for two of the main theorems, namely, Theorems 1 and 3, in Section 4. Complete proofs and further discussion can be found in (Yu, 2010).

2. Notation and Specifications
The projected Bellman equation (3) associated with TD() methods is a projected version of a multistep Bellman equation parametrized by   [0, 1]. In particular, let T be the Bellman operator T (J) = g +QJ for all J  n . The mapping T () in Eq. (3) is defined by


T () = (1 - ) T (1) (J) = lim T
1 m=0 ()

m T m+1 , (J), J 
n

  [0, 1); .

Let p denote the diagonal matrix with the diagonal elements being the steady-state probabilities of the Markov chain with transition matrix P , induced by the behavior policy. Equation (3) is equivalent to the low dimensional equation on nr ,  p r =  p T () (r)


=  p
m=0

m (Q)m g + (1 - )Qr .

By rearranging terms, it can be written as Ż Cr + Ż = 0, b (7)

Ż where Ż is an nr × 1 vector and C an nr × nr matrix, b given by


Ż =  p b
m=0 

m (Q)m g, m (Q)m (Q - I).
m=0

(8) (9)

Ż C =  p

The iterates bt , Ct in the off-policy LSTD() [Eqs. (5), (6)] aim to approximate Ż C respectively, (which deb, Ż fine the projected equation (7) and equivalently (3)).

Convergence of LSTD() Under General Conditions

Their convergence to Ż C, respectively, in any relevant b, Ż mode, is what we want to show. In the rest of the paper, we use it to denote the random state variable at time t and Ż or i to denote specific i states. To simplify notation, we denote  =  and study iterates of the form Zt =  pit-1 it ˇ Zt-1 + (it ),
t-1 t

3. Main Results
We pursue separately two lines of analysis, one based on properties of the finite space Markov chain {it } and the other based on properties of the topological space Markov chain {(it , Zt )}. In this section we overview our main results and outline key proof arguments. Throughout the paper, let ˇ denote the F -norm V = maxi,j |Vij | for a matrix V , and the infinity norm V = maxi |Vi | for a vector V , in particular, V = |V | for a scalar V . Let "a.s." stand for almost sure convergence. 3.1. Analysis Based on Finite Space Markov Chains First, it is not difficult to show that Gt converges in mean. This implies immediately that the LSTD() solution rt converges in probability to the solution r of Eq. (7) when the latter exists and is unique. Theorem 1. Under Assumption 1, for each inic tial condition z0 , supt E Zt  1- where c = max{ z0 , maxi (i) }. Under Assumptions 1 and 2, for each initial condition (z0 , G0 ),
t

qi

i

(10) (11)

Gt = (1 - t )Gt-1 + t Zt (it , it+1 ) ,

with  < 1, (z0 , G0 ) being the initial condition, and {t } being a stepsize sequence. The correspondence between iterates Gt and the vectors bt and matrices Ct in LSTD() [cf. Eqs. (5), (6)] is as follows: with t = 1/(t + 1), Gt = bt , if (it , it+1 ) = g(it , it+1 ), qi i Ct , if (it , it+1 ) =  pit it+1 ˇ (it+1 ) - (it ).
t t+1

(12) We want to show that Gt converges, in any relevant mode, to the constant vector/matrix


G =  p
m=0


 m Qm ,

(13)

lim E Gt - G = 0.

where the vector/matrix  is given row-wisely by   Ż (1) Ż  (2)  Ż  =  ˇ ˇ ˇ  , with (i) = E (i0 , i1 ) | i0 = i . Ż (n) Here and in what follows E denotes the expectation with respect to the distribution of the Markov chain {it } with transition matrix P . As can be seen, corresponding to the two choices of  in the expression of Gt [Eq. (12)],  equals g or (Q - I), and G equals Ż or C, respectively [cf. Eqs. (8)-(9)]. Ż b We make two assumptions, one on the transition matrices P and Q, as mentioned at the beginning of Section 1, and the other on the stepsize sequence. Assumption 1. The Markov chain {it } with transition matrix P is irreducible, and Q P in the sense of Eq. (1). Assumption 2. The sequence of stepsizes t is deterministic and satisfies t  (0, 1], t = ,
t t 2 t < ,

Next, based essentially on a zero-one law for tail events3 of Markov chains [see (Breiman, 1992), Theorem 7.43], we can show the following result. Proposition 1. Under Assumptions 1 and 2, for each initial condition (z0 , G0 ) and any E of the following events, either P(E) = 0 or P(E) = 1: (i) E = {limt Gt exists, and supt Zt <  }; (ii) E = {supt Zt <  }; (iii) E = {limt t Zt = 0}; (iv) E = {limt Gt exists}. Theorem 1 and Prop. 1(iv) together have the following implication on the convergence of Gt . According to Prop. 1(iv), for the event E = {limt Gt exists}, we have P(E) = 1 or 0. Suppose P(E) = 1. Then a.s. Gt  G for some random variable G. Since Theorem 1 implies Gt  G in probability, which further a.s. implies the convergence of a subsequence Gtk  G , a.s. we must have G = G a.s.; therefore Gt  G . Suppose now P(E) = 0. Then we only have the convergence of Gt to G in probability implied by Theorem 1, and with probability 1, on every sample path Gt does not converge. In Section 3.2, we will rule out the secAn event is called a tail event of a process {Xt } if it is determined by Xt , t  k for any k [see e.g., (Breiman, 1992), Def. 3.10].
3

lim sup
t

t < . (14) t-1

Such sequences of t include 1/t, t- ,   (0, 1], for instance. When conclusions hold for a specific sequence {t }, such as t = 1/t, we will state them explicitly.

Convergence of LSTD() Under General Conditions

ond case for the stepsize sequence t = 1/(t + 1), using the line of analysis based on the Markov chain {(it , Zt )}. We discuss other implications of Prop. 1, contrasting the off-policy case with the standard, on-policy case where P = Q. In the latter case, events (i) and (ii) in Prop. 1 both have probability one; event (ii) ­ the boundedness of Zt ­ is true by the definition of Zt . By contrast, in the off-policy case, under seemingly fairly common situations (as we show below), Zt is almost surely unbounded, and consequently, events (i) and (ii) have probability zero. While the unboundedness of Zt a.s. may sound disquieting, note that it is t Zt  0, the event shown in (iii), and not the boundedness of Zt , that is necessary for the almost sure convergence of Gt . In other words, {limt Gt exists}  {limt t Zt = 0}.4 For practical implementation, however, the unboundedness of Zt can be unwieldy. This suggests that in practice, instead of iterating Zt directly, we equivalently iterate t Zt via t Zt =  pit-1 it ˇ
t-1 t

Denote by Zt,j and j (it ) the jth element of the vector Zt and (it ), respectively. Consider a cycle configuration of states (Ż1 , Ż2 , . . . , Żm , Ż1 ) with the following three i i i i properties: (a) it occurs with positive probability: pŻ1Ż2 pŻ2Ż3 ˇ ˇ ˇ pŻmŻ1 > 0; i i i i i i (b) it has an amplifying effect in the sense that
i i  m pŻ1 Ż2

(16)

qŻ

Ż

i1 i2

qŻ2Ż3 i i pŻ2Ż3 i i

i i ˇ ˇ ˇ pŻm Ż1 > 1; im i1

qŻ

Ż

(17)

(c) for some Ż the Ż elements of (Ż1 ), . . . , (Żm ) j, jth i i have the same sign and their sum is non-zero: i.e., either for all k = 1, . . . , m, Ż (Żk )  0, with Ż (Żk ) > 0 for some k; (18) j i j i or for all k = 1, . . . , m, Ż (Żk )  0, with Ż (Żk ) < 0 for some k. (19) j i j i Proposition 2. Suppose there exists a cycle configuration of states (Ż1 , Ż2 , . . . , Żm , Ż1 ) possessing properties i i i i (a)-(c) above, and Ż is as in (c). Then there exists j a constant , which depends on the cycle and is negative (respectively, positive) if Eq. (18) (respectively, Eq. (19)) holds in (c), and if for some neighborhood O() of , P(it = Ż1 , Zt,Ż  O() i.o.) = 1, then i j P(supt Zt = ) = 1. We remark that the extra technical condition P(it = Ż1 , Zt,Ż  O() i.o.) = 1 in Prop. 2 is nonrestrictive. i j The opposite case ­ that on a set with non-negligible probability, Zt,Ż eventually always lies arbitrarily close j to  whenver it = Ż1 ­ seems unlikely to occur except i in highly contrived examples. 3.2. Analysis Based on Topological Space Markov Chains To establish the almost sure convergence of Gt to G , we consider the Markov chain {(it , Zt ), t  0} on the topological space S = I × nr with product topology (discrete topology on I and usual topology on nr ). We show that {(it , Zt )} can be related to a type of Markov chains, called e-chains, whose transition kernel functions possess a certain equicontinuity property (Meyn & Tweedie, 2009). Central to our proof is the analysis of the differences in the processes {Zt } for different initial conditions z0 and the same sample path of {it }. As can already be seen from Eq. (10), for ^ two such processes {Zt }, {Zt } with initial conditions z0 , z0 , respectively, their differences satisfy the simple ^ recursion:
qi i ^ ^ Zt - Zt =  pit-1 it ˇ (Zt-1 - Zt-1 ),
t-1 t

qi

i

t t-1

ˇ (t-1 Zt-1 ) + t (it ),

(15)

whenever the magnitude of Zt becomes intolerably a.s. large. That t Zt  0 when t = 1/(t + 1) will be implied by the almost sure convergence of Gt we later establish. We now demonstrate by construction that in seemingly fairly common situations, Zt is almost surely unbounded. Our construction is based on a consequence of the extended Borel-Cantelli lemma [(Breiman, 1992), Problem 5.9, p. 97], which says that for any process {Xt , t  0} with Xt taking values in S, and any measurable subsets A, B of S, if for all t, P (s, s > t, Xs  B | Xt , Xt-1 , . . . , X0 )   > 0 on {Xt  A}, then {Xt  A i.o.}  {Xt  B i.o.} a.s. Here, "i.o." stands for "infinitely often," and "a.s." means that the set-inclusion relation holds after excluding a set of zero probability. In our context, this result together with the zero-one probability statement for the event {supt Zt < } in Prop. 1(ii) has the following implications.
4

This can be seen from the fact that Gt - Gt-1 = -t Gt-1 + t Zt (it , it+1 ) ,

and t  0 as t  .

(20)

Convergence of LSTD() Under General Conditions

which implies that the difference sequence converges almost surely to zero (Lemma 1). Using more careful characterizations of such difference sequences together with the first part of Theorem 1, we can establish the various properties needed for applying the law of large numbers (LLN) for e-chains (Meyn & Tweedie, 2009) and show that the chain {(it , Zt )} is ergodic. Our conclusions are summarized in the following two theorems. (Definitions of related terminologies and detailed analysis can be found in (Yu, 2010).) Theorem 2. Under Assumption 1, the Markov chain {(it , Zt )} is an e-chain with a unique invariant probability measure , and almost surely, for each initial condition, the sequence of occupation measures {ľt } on S converges weakly to , where ľt is defined by 1 ľt (A) = t
t

4.1. Proof of Theorem 1 To show the first part of Theorem 1, we have by Eqs. (10) and (21),
t-1

Zt =  t Lt z0 + 0
m=0

 m Lt (it-m ), t-m

so, with c = max{ z0 , maxi (i) },
t-1

E Zt  c E  t Lt + 0
m=0 t

 m Lt t-m c . 1-

=c
m=0

m 

1A (ik , Zk )
k=1

for all Borel-measurable subsets A of S, and 1A denotes the indicator function for the set A. Let E denote expectation with respect to the stationary distribution P of the Markov chain {(it , Zt )} with initial distribution . Theorem 3. Under Assumption 1, G = E [Z0 (i0 , i1 ) ], and with stepsize t = 1/(t + 1), for a.s. each initial condition (z0 , G0 ), Gt  G . Theorem 3 implies that for each initial condition, the sequence {rt } computed by the off-policy LSTD() algorithm converges almost surely to the solution r of the projected Bellman equation (3) when the latter exists and is unique.

To prove the second part of theorem on the convergence of Gt to G in the first mean, we first consider another process (Zt,T , Gt,T ) on the same probability space, and apply the LLN for a finite space irreducible Markov chain to Gt,T . We then relate (Zt,T , Gt,T ) to (Zt , Gt ). In particular, for a positive integer T , define ~ Zt,T = Zt , and define for t > T , ~ Zt,T = (it ) + Lt (it-1 ) + ˇ ˇ ˇ +  T Lt ˇ (it-T ); t-1 t-T (22) ~ Gt,T = (1 - t )Gt-1,T + t Zt,T (it , it+1 ) , t  1. (23) ~ Note that for t  T , Gt,T = Gt because Zt,T and Zt coincide. It is straightforward to show {Gt,T } converges almost surely to a constant G related to G . By T ~ construction {Zt,T } is bounded. Furthermore, if we consider the finite space Markov chain {Xt } with Xt = (it-T , it-T +1 , . . . , it , it+1 ), then for t > T , ~ Zt,T (it , it+1 ) is a function of Xt . Denote this function by f . Since GT,T takes values in a finite set (whose size depends on T ), an application of LLN and stochastic approximation theory (see e.g., (Borkar, 2008), Chap. 6, Theorem 7 and Cor. 8] shows that under the stepsize condition in Assumption 2, Gt,T converges a.s. to E0 [f (XT +1 )], the expectation of f (XT +1 ) under the stationary distribution of the Markov chain {Xt } (equivalently, that of the chain {it }):
a.s. ~ Gt,T  G = E0 ZT +1,T (iT +1 , iT +2 ) T T

t  T;

G0,T = G0 ,

4. Details of Analysis
Due to space limit, we give the proof for Theorem 1 on the convergence of LSTD() iterates bt , Ct in the first mean, and a partial proof for Theorem 3 on the almost sure convergence of LSTD(). Complete proofs for all theorems in Section 3 can be found in (Yu, 2010). We denote by Lt the product of ratios of transition s probabilities along a segment of the state trajectory, (is , is+1 , . . . , it ): Lt = s Define and Lt t
qis is+1 pis is+1

ˇ

qis+1 is+2 pis+1 is+2

ˇ ˇ ˇ pit-1 it .
t-1 t

qi

i

(21) Lt s = Lt s

= 1. Note that for s  s  t, E[ Lt s | is ] = 1.

Ls s

=  p
m=0

 m Qm .

(24)

Convergence of LSTD() Under General Conditions

~ We now relate (Zt , Gt ) to (Zt,T , Gt,T ). First we bound ~t,T . By definition Zt - Zt,T = 0 for t  T . ~ E Zt - Z For t  T + 1, similarly to bounding E Zt , we have with c = max{ z0 , maxi (i) },
t-1

4.2. Proof of Theorem 3 We need the following lemma, which also plays a key role in establishing Theorem 2. Lemma 1. Let Yt = Lt Yt-1 with Y0 = y0  m t-1 and  < 1. Then, the sequence of nonnegative scalar a.s. a.s. random variables  t Lt  0, and Yt =  t Lt y0  0. 0 0 Proof. From the definition of Yt and Lt [cf. Eq. (21)], s Yt =  t Lt Lt-1 ˇ ˇ ˇ L1 y0 =  t Lt y0 . Consider the nont-1 t-2 0 0 negative sequence Xt =  t Lt with X0 = 1. We have 0 Xt = Lt Xt-1 ,  E Xt | Ft-1 = Xt-1  Xt-1 , t-1 where Ft-1 = (is , s  t - 1) is the -field generated by is , s  t - 1. This implies that {(Xt , Ft )} is a nonnegative supermartingale. Since EX0 = 1 < , by a martingale congergence theorem [see (Breiman, a.s. 1992), Theorem 5.14 and its proof], Xt  X, a nonnegative random variable with EX  lim inf t EXt . Since  < 1, EXt =  t  0 as t  . Therefore a.s. a.s. X = 0 a.s., implying Xt  0 and Yt  0. By Theorem 2 the Markov chain {(it , Zt )} has a unique invariant probability measure . It can be shown that under the stationary distribution P of {(it , Zt )} with initial distribution , E Z0 (i0 , i1 ) <  [(Yu, 2010), Prop. 5.3]. We can now prove Theorem 3, which states that with t = 1/(t + 1), for each initial condia.s. tion (z0 , G0 ), Gt  G = E Z0 (i0 , i1 ) . Fix G0 , and consider an initial condition (z0 , G0 ) for any z0 . Consider the sequence {Gt } corresponding to t = 1/(t + 1), and a related sequence {Gt } given below, with Z0 = z0 : Gt = 1 t+1 1 t+1
t

~ E Zt - Zt,T = E  t Lt z0 + 0
m=T +1 t

 m Lt (it-m ) t-m c T +1 . 1- (25)

 cE
m=T +1

 m Lt t-m 

Next we bound E Gt - Gt,T . By the definition of Gt and Gt,T , Gt - Gt,T = (1 - t ) Gt-1 - Gt-1,T + ~ t Zt - Zt,T (it , it+1 ) . Consequently, with c = maxi,j (i, j) , E Gt - Gt,T  (1 - t )E Gt-1 - Gt-1,T + ~ t c E Zt - Zt,T  (1 - t )E Gt-1 - Gt-1,T + t
T, (26)

where the last inequality follows from Eq. (25), and for some constant c,
T

= c T +1 /(1 - )  0, as T  .

(27)

Since t  (0, 1] and Gt - Gt,T Eq. (26) implies sup E Gt - Gt,T 
t

= 0 for t  T ,
T.

(28)

We now bound E Gt,T - G . By Eq. (24) Gt,T - T a.s. G  0. By the construction of Gt,T and the fact t  T (0, 1], for some deterministic constant cT depending on T , Gt,T  cT , t. Therefore, by the Lebesgue bounded convergence theorem,
t

Zk (ik , ik+1 ) + G0 ,
k=1 t

lim E Gt,T -

G T

Gt =

Zk (ik , ik+1 ) .
k=0

= 0.

(29)

Combining Eqs. (28) and (29), we have lim sup E Gt - G  lim sup E Gt - Gt,T +
t t t 

lim E Gt,T - G + T (30)

G - G T  T + 0 + ~T ,


where ~T = G - , and ~T  0 as T  , as can be seen from the definition of G and G , T Eqs. (13) and (24). Letting T go to  in the righthand-side of (30) and using also Eq. (27), it follows that lim supt E Gt - G = 0. This completes the proof.

G T

Since G0 /(t + 1)  0 and Z0 (i0 , i1 ) /(t + 1)  0 as t  , the convergence of {Gt } on a sample path is equivalent to that of {Gt }, which does not depend on G0 . Since E Z0 (i0 , i1 ) < , applying LLN [see (Meyn & Tweedie, 2009), Theorem 17.1.2; (Doob, 1953), Theorem 2.1] to the stationary Markov process {(it , Zt , it+1 )} under P , and using also the second part of Theorem 1, it can be shown that for each initial condition x = (Ż z ) from a measurable set F with i, Ż a.s. (F ) = 1, Gt  G , and G = E Z0 (i0 , i1 ) . a.s. Hence Gt  G for initial condition x  F . We now show for any initial condition x  F , the ^ ^ corresponding Gt also converges almost surely to G .

Convergence of LSTD() Under General Conditions

Let x = (Ż z ). Since {it } is irreducible, ({Ż × nr ) > ^ i, ^ i} 0. We also have (F ) = 1, so there exists x = (Ż z )  Ż i, Ż F for some z  nr . Let  = z - z. Consider the Ż ^ ^ ^ two processes (Zt , Gt ) and (Zt , Gt ) corresponding to the two initial conditions x  F, x  F , respectively, ^ Ż and for the same path of {it }. By Lemma 1, we have ^ Zt - Zt =  t Lt , 0  t Lt  0. 0
t a.s. a.s.

Borkar, V. S. Stochastic Approximation: A Dynamic Viewpoint. Hindustan Book Agency, New Delhi, 2008. Boyan, J. A. Least-squares temporal difference learning. In Proc. the 16th ICML, pp. 49­56, 1999. Bradtke, S. J. and Barto, A. G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(2):33­57, 1996. Breiman, L. Probability. SIAM, Philadelphia, PA, 1992. Doob, J. L. Stochastic Processes. John Wiley & Sons, New York, 1953. Glynn, P. W. and Iglehart, D. L. Importance sampling for stochastic simulations. Management Science, 35: 1367­1392, 1989. Kushner, H. J. and Yin, G. G. Stochastic Approximation and Recursive Algorithms and Applications. Springer-Verlag, New York, 2nd edition, 2003. Meyn, S. Control Techniques for Complex Networks. Cambridge University Press, Cambdrige, UK, 2007. Meyn, S. and Tweedie, R. L. Markov Chains and Stochastic Stability. Cambridge University Press, Cambdrige, UK, 2nd edition, 2009. Nedi´, A. and Bertsekas, D. P. Least squares policy c evaluation algorithms with linear function approximation. Discrete Event Dyn. Syst., 13:79­110, 2003. Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy temporal-difference learning with function approximation. In Proc. the 18th ICML, pp. 417­424, 2001. Sutton, R. S. Learning to predict by the methods of temporal differences. Machine Learning, 3:9­44, 1988. Sutton, R. S. and Barto, A. G. Reinforcement Learning. MIT Press, Cambridge, MA, 1998. Tsitsiklis, J. N. and Van Roy, B. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Contr., 42(5):674­ 690, 1997. Yao, H. S. and Liu, Z. Q. Preconditioned temporal difference learning. In Proc. the 25th ICML, pp. 1208­1215, 2008. Yu, H. Convergence of least squares temporal difference methods under general conditions. Tech. Report C-2010-1, Dept. CS, Univ. of Helsinki, 2010.

1 The second relation implies also t+1 k=1  k Lk  0. 0 Therefore, with c = maxi,j (i, j) , we have

^ Gt - Gt =

1 t+1

t

^ Zk - Zk (ik , ik+1 )
k=1

c 
a.s.

1 t+1

t

 k Lk 0
k=1

a.s.

 0.

We also have Gt  G (because its initial condition ^ a.s. x  F ); therefore Gt  G . Ż
a.s. Thus for any initial condition (Ż z ) and G0 , Gt  G . i, Ż Since the space of i0 is finite, this implies for any initial a.s. distribution of i0 and initial (Ż, G0 ), Gt  G . The z proof is complete.

Acknowledgments
I thank Prof. Dimitri Bertsekas, Dr. Dario Gasbarra, and the anonymous reviewers for their helpful feedback. This work is supported in part by Academy of Finland Grant 118653 (ALGODAN) and by PASCAL IST-2002-506778.

References
Ahamed, T. P., Borkar, V. S., and Juneja, S. Adaptive importance sampling technique for Markov chains using stochastic approximation. Operations Research, 54:489­504, 2006. Bertsekas, D. P. Dynamic Programming and Optimal Control, volume II. Athena Scientific, Belmont, MA, third edition, 2007. Bertsekas, D. P. Projected equations, variational inequalities, and temporal difference methods. IEEE Trans. Automat. Contr., 2009. to appear. Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996. Bertsekas, D. P. and Yu, H. Projected equation methods for approximate solution of large linear systems. J. Computational and Applied Mathematics, 227(1): 27­50, 2009.