A Worst-Case Comparison b etween Temp oral Difference and Residual Gradient with Linear Function Approximation Lihong Li lihong@cs.rutgers.edu Department of Computer Science, Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ, USA 08854 Abstract Residual gradient (RG) was proposed as an alternative to TD(0) for policy evaluation when function approximation is used, but there exists little formal analysis comparing them except in very limited cases. This paper employs techniques from online learning of linear functions and provides a worst-case (non-probabilistic) analysis to compare these two types of algorithms when linear function approximation is used. No statistical assumptions are made on the sequence of observations, so the analysis applies to nonMarkovian and even adversarial domains as well. In particular, our results suggest that RG may result in smaller temporal differences, while TD(0) is more likely to yield smaller prediction errors. These phenomena can be observed even in two simple Markov chain examples that are non-adversarial. An important step in this optimization process is policy evaluation --the problem of evaluating expected returns of a fixed policy. This problem is often the most challenging step in approximate policy-iteration algorithms (Bertsekas & Tsitsiklis, 1996; Lagoudakis & Parr, 2003). Temporal difference (TD) is a family of algorithms for policy evaluation (Sutton, 1988) and has received a lot of attention from the community. Unfortunately, it is observed (e.g., Baird (1995)) that TD methods may diverge when they are combined with function approximation. An alternative algorithm known as residual gradient (RG) was proposed by Baird (1995) and enjoys guaranteed convergence to a local optimum. Since RG is similar to TD(0), a particular instance of the TD family, we will focus on RG, TD(0), and a variant of TD(0) in this paper. Despite convergence issues, little is known that compares RG and TD(0). Building on previous work on online learning of linear functions (Cesa-Bianchi et al., 1996) and a similar analysis by Schapire and Warmuth (1996), we provide a worst-case (nonprobabilistic) analysis of these algorithms and focus on two evaluation metrics: (i) total squared prediction error, and (ii) total squared temporal difference. The former measures accuracy of the predictions, while the latter measures consistency and is closely related to the Bel lman error (Sutton & Barto, 1998). Either metric may be preferred over the other in different situations. For instance, Lagoudakis and Parr (2003) argue that TD solutions tend to preserve the shape of the value function and is more suitable for approximate policy iteration, while there is evidence that minimizing squared Bellman errors is more robust in general (Munos, 2003). Our analysis suggests that TD can make more accurate predictions, while RG can result in smaller temporal differences. All terms will be made precise in the next section. Although our theory focuses on worst-case upper bounds, we also provide numerical evidence and expect the resulting insights to give useful guidance to RL practitioners in deciding which algorithm best suits their purposes. 1. Intro duction Reinforcement learning (RL) is a learning paradigm for optimal sequential decision making (Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998) and has been successfully applied to a number of challenging problems. In the RL framework, the agent interacts with the environment in discrete timesteps by repeatedly observing its current state, taking an action, receiving a real-valued reward, and transitioning to a next state. A policy is a function that maps states to actions; semantically, it specifies what action to take given the current state. The goal of an agent is to optimize its policy in order to maximize the expected long-term return, namely, the discounted sum of rewards it receives by following the policy. Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). A Worst-Case Comparison b etween TD and RG with Linear Function Approximation 2. Preliminaries Fully observable environments in RL are often modelled as Markov decision processes (Puterman, 1994), which are equivalent to induced Markov chains when controlled by a fixed policy. Here, however, we consider a different model that is suitable for worst-case analysis, as introduced in the next subsection. This model makes no statistical assumption about the observations, and thus our results apply to much more general situations including partial ly observable or adversarial environments that subsume Markov chains. Some notation is in order. We use bold-face, lower-case letters to denote real-valued column vectors such as v. Their components are denoted by the corresponding letter with subscripts such as vt . We use · to denote the Euclidean, or 2-norm: v = v v where v is the transpose of v. For a square matrix M , the set of eigenvalues of M , known as the spectrum of M , is denoted (M ). If M is symmetric, its eigenvalues must be real, and its largest eigenvalue is denoted (M ). 2.1. The Sequential Online Learning Mo del Our learning model is adopted from Schapire and Warmuth (1996) and is an extension of the online-learning model to sequential prediction problems. Let k be the dimension of input vectors. The agent maintains a weight vector of the same dimension and uses it to make predictions. In RL, input vectors are often feature vectors of states or state­action pairs, and are used to approximate value functions (Sutton & Barto, 1998). Learning proceeds in discrete timesteps and terminates after T steps. The agent starts with an initial input vector x1 Rk and an initial weight vector w1 Rk . At timestep t {1, 2, 3, · · · , T }: · The agent makes a prediction yt = wt xt R, ^ where wt is the weight vector at time t. Throughout the paper, assume xt X for some known constant X > 0. · The agent then observes an immediate reward rt R and the next input vector xt+1 . Based on this information, it updates its weight vector whose new value is denoted wt+1 . The change in weight is wt = wt+1 - wt . By convention, rt = 0 and xt = 0 for t > T . Define the return at time t by yt = =t -t r , where [0, 1) is the discount factor. Since < 1, it effectively diminishes future rewards exponentially fast. A quick observation is that yt = rt + yt+1 , which is analogous to the Bellman equation for Markov chains (Sutton & Barto, 1998). The agent attempts to mimic yt by its prediction yt , and the prediction error is et = yt - ^ yt . Our first evaluation Tmetric is the total squared ^ 2 prediction error : P = t=1 e2 = e . t Another useful metric in RL is the temporal differences (also known as TD errors ), which measures how consistent the predictions are. In particular, the temporal difference at time t is dt = rt + wt xt+1 - wt xt , anT the total squared temporal difference is T D = d 2 2 t=1 dt = d . 2.2. Previous Work Previous convergence results of TD and RG often rely heavily on certain stochastic assumptions of the environment such as the assumption that the sequence of observations, [(xt , rt )]tN , are generated by an irreducible and aperiodic Markov chain. Tsitsiklis and Van Roy (1997) first proved convergence of TD with linear function approximation, while they also pointed out the potential divergence risk when nonlinear approximation is used. To resolve the instability issue of TD(0), Baird (1995) proposed the RG algorithm, but also noted that RG may converge more slowly than TD(0) in some problems. Such an observation was later proved by Schoknecht and Merke (2003), who used spectral analysis to compare the asymptotic convergence rates of the two algorithms. Although their results are interesting, they only apply to quite limited cases where, for example, a certain matrix associated with TD updates has real eigenvalues only (which does not hold in general). More importantly, they study synchronous updates while TD and RG are often applied asynchronously in practice. Furthermore, their results assume that the value function is represented by a lookup table, but the initial motivation of studying RG was to develop a provably convergent algorithm when function approximation is used. Schapire and Warmuth (1996) were also concerned with similar worst-case behavior of TD-like algorithms within the model described in Subsection 2.1. They defined a new class of algorithms called TD (), which is very similar to the TD() algorithms of Sutton (1988). They developed worst-case bounds for the total squared prediction error of TD (), but not the total squared temporal difference. 2.3. Algorithms The algorithms we consider all update the weight vector incrementally and differ only in the update rules. TD(0) uses the following rule: wt = dt xt , (1) A Worst-Case Comparison b etween TD and RG with Linear Function Approximation where (0, 1) is the step-size parameter controlling aggressiveness of the update. Although TD(0) is widely used in practice, analysis turns out to be easier with a close relative of it, TD (0). This algorithm differs from TD(0) in that it adapts the step-size based on the input vectors (Schapire and Warmuth (1996) defined TD (0) in a different, but equivalent, form): wt = dt xt . 1 - xt xt+1 (2) 3.2. Squared Prediction Errors of TD (0) Using step-size = X 21+1 , Schapire and Warmuth (1996) showed a worst-case upper bound: 1u 1 + X2 P + w1 - u 2 2 P . - 2 Furthermore, if E and W are known beforehand such that u E and w1 - u W , then the step-size P can be optimized by = X EWX 2 W to yield an + asymptotically better bound: u + 2W X E + X 2 W 2 P . (4) P 1 - 2 3.3. Squared Temp oral Differences of TD (0) We will extend the analysis of Schapire and Warmuth (1996) to the new loss function T D by examining how the potential function, wt - u 2, evolves when a single update is made at time t. It can be shown (Schapire & Warmuth, 1996, Eqn 8) that - w1 - u 2 2 X 2 e D De + 2 e D (eu - e), where 1 - 0 · · · 0 0 0 1 - · · · 0 0 .. . D= . (5) 0 0 ··· 1 - 0 0 0 ··· 0 1 - 0 0 1 0 0 ··· Define f = De. According to Lemma A.1(1), du = Deu , and hence the inequality above is rewritten as: - w1 - u 2 2 X 2 f f Due to space limitation, we only provide results for TD (0), but similar results hold for TD(0). It is expected, and also supported by the numerical evidence in Section 4, that TD(0) and TD (0) have similar behavior and performance in practice. For this reason, we refer to both algorithms as TD in the rest of the paper if there is no risk of confusion. In contrast, RG uses the following update rule: wt = dt (xt - xt+1 ) . (3) 3. Main Results This section contains the main theoretical results. We will first describe how to evaluate an algorithm in the worst-case scenario. For completeness, we also summarize the squared prediction error bounds for TD (0) due to Schapire and Warmuth (1996). Then, we analyze total squared temporal difference bounds and RG. Our analysis makes a few uses of matrix theory (see, e.g., Horn and Johnson (1986)), and several technical lemmas are found in the appendix. Two basic facts about (M ) will be used repeatedly: (i) if M is negative-definite, then (M ) < 0; and (ii) the Rayleigh-Ritz theorem (Horn & Johnson, 1986, TheoM rem 4.2.2) states that (M ) = maxv=0 vv vv . 3.1. Evaluation Criterion Analogous to other online-learning analysis, we treat P and T D as total losses, and compare the total loss of an algorithm to that of an arbitrary weight vector, u. We wish to prove that this difference is small for al l u, including the optimal (in any well-defined sense) but unknown vector u . u The prediction using vector u at time t is yt = x u t . Accordingly, the prediction error and temu poral difference at time t are eu = yt - yt and t u x x dt = rt + u t+1 - u t , respectively. The total squared prediction error and total squared temporal y 2 T difference of u are P = eu 2 = t=1 t - u xt u r 2 T x x and u D = du 2 = t + u t+1 - u t, T t=1 respectively. - 2 f D-1 2 f + 2 f 2 D-1 u d. Using the factthat 2p q p + q for p = D- f , q = bdu , and arbitrary b > 0, the inequalb ity becomes - w1 - u 2 f M1 = 2 X 2 I + 2 -1 - DD b M 1f + b u D , where T ) - (D-1 + D- (6) is a symmetric matrix. Since (M1 ) is the largest eigenvalue of M1 , we have f M1 f (M1 ) f 2, and hence, - w1 - u 2 f 2 (M1 ) + b u D . Combining T this with Lemma A.4, we have that f 2 is at most X b , 1 2 2 (1 + ) u D + w1 - b 2 + T b(1 - )2 when the step-size is = (1 + ) X 1 2 1 b(1- )2 . (7) + A Worst-Case Comparison b etween TD and RG with Linear Function Approximation Due to Lemma A.1 (2), we have 1 2 d2 = - xt xt+1 ft2 t 1 2 (1 + 2 )2 2 f. + X 2 ft2 (1 + )2 t Therefore, T D is at most X 1 2 2 + (1 + 2 ) b(1 - )2 b uD T + w1 - u 2 . Similar to the previous section, we use the potential function wt - u 2 to measure progress of learning: = tT - w1 - u 2 =1 w t+1 - u 2 - wt - u 2 = tT =1 2 2 wt (wt - u) + wt wt dt (du - dt ) + 2 d2 X 2 (1 + )2 t t - 2 d d tT =1 Using b = 1, we have thus proved the first main result. Theorem 3.1. Let be given by Eqn 7 using b = 1, then the fol lowing holds for TD (0): X u . 1 2 + T D (1+2 )2 T D + w1 - u 2 2 (1 - ) Theorem 3.2. If E and W are known beforehand such that u D E and w1 - u W , then can T be optimized in TD (0) so that T D (1 + 2 )2 u 2X W E TD + + X 2 w1 - u 2 (1 - )2 1- . (8) 2 d du + 2 X 2 (1 + )2 d d . According to Lemma A.1 (1) and using the fact that 2 2 2p q p + q for p = b D d, q = beu , and arbitrary b > 0, the inequality above is written as: - w1 - u 2 2 b eu 2 + d DD b 2 2 X (1 + )2 - 2 d + d 2 Proof. Previous analysis for Theorem 3.1 yields b + w1 - u 2 2 2 T D (1 + 2 ) X uD + T b(1 - )2 u TD + X 2 w1 - u 2 (1 - )2 b + W2 (1 + 2 )2 X 2E + b(1 - )2 u f TD + X 2 w1 - u 2 (1 - )2 or any b > 0. We may simply choose b = and the step-size in Eqn 7 becomes = X (1 + ) 1 2 XE W (1- ) W , X (1- ) E Due to Lemma A.1 (3), d = De, where 1 = diag , 1 + (x1 - x2 ) x2 1 ,··· , 1 + (x2 - x3 ) x3 . 1 ,1 1 + (xT -1 - xT ) xT Then, the inequality above becomes: - w1 - u 2 b eu 2 + e where M2 = D (9) M 2 e, ,, 2 DD b + ` ´ 2 X 2 (1 + )2 - 2 I « D . (10) Since e M2 e (M2 ) e , Lemma A.5 implies the following theorems when the step-size is = (1 + )2 1 X 2 2 . + + 1 b . (11) 3.4. Squared Prediction Errors of RG By the update rule in Eqn 3 and simple algebra, wt (wt - u) = dt (xt - xt+1 ) (wt - u) - wt xt - wt xt+1 - rt = dt ux = x t - u t+1 - rt u dt (dt - dt ), = 2 d2 xt - xt+1 2 2 t 2 d2 X 2 (1 + )2 . t Theorem 3.3. Let be given by Eqn 11 using b = 1, then the fol lowing holds for RG: ( X2 u . +1 (1 + 2 )2 P + w1 - u 2 P 1 - )2 Theorem 3.4. If E and W are known beforehand such that u E and w1 - u W , then can P be optimized in RG so that . (1 + 2 )2 u + 2X W E + X 2 w1 - u 2 (12) P P (1 - )2 wt 2 A Worst-Case Comparison b etween TD and RG with Linear Function Approximation Proof. Previous analysis in this subsection yields P (1 + 2 ) (1 - )2 X 2 2 u P + X 2 w1 - u 2 + Table 1. Asymptotic upper bounds for total squared prediction error and total squared temporal difference of TD (0) and RG. w1 - u 2 b u + 2 (1 + 2 ) + X 2 w1 - u 2 P (1 - )2 . X 2 W 2 bE + b bu+ P P/ u P TD (0) RG 1 1- 2 + o(1) (1+2 ) 2 (1- )2 + o(1) T D/ u D T (1+2 ) 2 (1- )2 + o(1) 1 + o(1) We simply choose b = XW E and accordingly the stepsize in Eqn 11 becomes 1. On the other extreme where = 0, all these asymptotic bounds coincide, which is not surprising as TD(0), TD (0), and RG are all identical when = 0. Since it is unknown whether the leading constants in Table 1 are optimal, the next section will provide numerical evidence to support our claims about the relative strengths of these algorithms. It is worth mentioning that in sequential prediction 1 or decision problems, the factor 1- often plays a role similar to the decision horizon (Puterman, 1994). Therefore, in some sense, our bounds also characterize how prediction errors and temporal differences may scale with decision horizon, in the worst-case sense. When P or T D are relatively small, the asymptotic bounds in Table 1 are less useful as the w1 - u 2 in the bounds dominate P or T D . However, we still get similar qualitative results by comparing the constant factors of the term w1 - u 2 in the bounds. Since our setting is quite different from that of Schoknecht and Merke (2003), our results are not comparable to theirs. = (1 + )2 1 X 2 + XE W . 3.5. Squared Temp oral Differences of RG It is most convenient to turn this problem into one of analyzing the total squared prediction error in the original online-learning-of-linear-function framework (Cesa-Bianchi et al., 1996). In particular, define zt = xt - xt+1 and thus zt (1 + )X . Now, RG can be viewed as a gradient descent algorithm operating over the sequence of data [(zt , rt )]t{1,2,··· ,T } . Due to Theorem IV.1 of Cesa-Bianchi et al. (1996), we immediately have , u 2 T D 2.25 T D + X 2 (1 + )2 u for any u when the step-size is = 3X 2 (2+ )2 . If E and 1 W are known beforehand so that u D E and u T W , then can be optimized (Theorem IV.3 of CesaW Bianchi et al. (1996)) by = X (1+ )(W X (1+ )+E ) to obtain the following improved bound: T D T D + 2W X (1 + ) E + (1 + )2 W 2 X 2 . (13) u 3.6. Discussions Based on Eqns 4, 8, 12, and 13, Table 1 summarizes the asymptotic upper bounds (when T ) assuming E and W are known beforehand to optimize .1 Although our bounds are all upper bounds, results in the table suggest that, in worst cases, TD (0) (and also TD(0)) tend to make smaller prediction errors, while RG tends to make smaller temporal differences. The gaps between corresponding bounds increase as Strictly speaking, the validity of these asymptotic re sults relies on the assumptions that (i) E = o( u ), and P (ii) W and X remain constant as T . Both assumptions are reasonable in practice. 1 4. Exp eriments This section presents empirical evidence in two Markov chains that supports our claims in Section 3.6. The first is the Ring Markov chain (Figure 1 (a)), a variant of the Hall problem introduced by Baird (1995) in which RG was observed to converge to the optimal weights more slowly than TD(0). The state space is a ring consisting of 10 states numbered from 0 through 9. Each state is associated with a randomly selected feature vector of dimension k = 5: x(0) , · · · , x(9) Rk . Transitions are deterministic and are indicated by arrows. The reward in every state is stochastic and is distributed uniformly in [-0.1, 0.1]. As in Hall , the value of every state is exactly 0. The second problem is a benchmark problem known as PuddleWorld (Boyan & Moore, 1995). The state space is a unit square (Figure 1 (d)), and a start state of an episode is randomly selected in [0, 0.2] × [0, 0.2]. A Worst-Case Comparison b etween TD and RG with Linear Function Approximation 700 per-step squared prediction error 9 8 7 6 500 400 300 200 100 0 0.7 2 3 5 4 RG per-step squared tem raldif rence po f e 0 1 600 TD(0) TD*(0) 0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.8 0.9 0.95 0.98 discount 0.99 0.995 0.999 0.7 0.8 0.9 0.95 0.98 discount 0.99 0.995 0.999 (a) Ring 1 per-step squared prediction error (b) per-step squared prediction error 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 0.7 0.8 0.9 0.95 0.98 0.99 0.995 0.999 discount (c) per-step squared temporal difference 90 80 70 60 50 40 30 0.7 0.8 0.9 0.95 0.98 0.99 0.995 0.999 discount per-step squared tem raldif rence po f e 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 (d) PuddleWorld (e) per-step squared prediction error (f ) per-step squared temporal difference Figure 1. Two Markov chains we used: (a) Ring and (d) PuddleWorld (Boyan & Moore, 1995). All results are averaged over 500 runs, with 99% confidence intervals plotted. Ring and PuddleWorld results are in (b,c) and (e,f ), respectively. The agent adopts a fixed policy that goes north or east with probability 0.5 each. Every episode takes about 40 steps to terminate. The reward is -1 unless the agent steps into the puddles and receives penalty for that; the smallest possible reward is -41. We used 16 RBF features of width 0.3, whose centers were evenly distributed in the state space. We also tried a degree-two polynomial feature: for a state , s = (s1 , s2 ) the feature vector had six components: 1 . xs = , s1 , s2 , s1 s2 , s2 , s2 Since the results are sim12 ilar to those for RBF features, they are not included. We ran three algorithms in the experiments: TD(0), TD (0), and RG. For a fair comparison, all algorithms started with the all-one weight vector and were given the same sequence of (xt , rt ) for learning. The procedure was repeated 500 times. For Ring , each run used a different realization of feature x(s) and T = 500; for PuddleWorld , each run consisted of 50 episodes (yielding slightly less than 2000 steps in total). A wide range of step-sizes were tried, and the best choices for each discount-factor­algorithm combination were used to evaluate P and T D , respectively. Figure 1 (b,c,e,f ) gives the average per-step squared prediction errors and squared temporal differences for these two problems, with 99% confidence intervals plotted. These results are consistent with our analysis: TD(0) and TD (0) tended to make more accurate predictions, while RG did a better job at minimizing temporal differences; the differences between these algorithms were even larger as the discount factor approached 1.2 Finally, as a side effect, it is verified that TD(0) and TD (0) had essentially identical performance, although their best learning rates might differ. 5. Conclusion We have carried out a worst-case analysis to compare two policy-evaluation algorithms, TD and RG, when linear function approximation is used. Together with previously known results due to Schapire and Warmuth (1996) and Cesa-Bianchi et al. (1996), our results suggest that, although the TD algorithms may make more accurate predictions, RG may be a better choice when small temporal differences are desired. This claim is supported by empirical evidence in two simple Markov chains. Although the analysis is purely mathematical, we expect the implications to deepen the understanding of these two types of algorithms and can provide useful insights to RL practitioners. This effect was less obvious when got too close to 1. This was because the tra jectories in our experiments were not long enough for such to have full impacts. 2 A Worst-Case Comparison b etween TD and RG with Linear Function Approximation There has been relatively little attention to this sort of online-learning analysis within the RL community. Our analysis shows that this kind of analysis may be helpful and provide useful insights. A few directions are worth pursuing. First, we have focused on worst-case upper bounds, but it remains open whether matching lower bounds can be found. More extensive empirical studies are also necessary to see if such worst-case behavior can be observed in realistic problems. Second, we wish to generalize the analysis of total squared temporal difference from TD(0) and TD (0) to TD() and TD (), respectively. Finally, we would like to mention that, in their original forms, both TD and RG use additive updates. Another class of updates known as multiplicative updates (Kivinen & Warmuth, 1997) has been useful when the number of features (i.e., the k in Subsection 2.1) is large but only a few of them are relevant for making predictions. Such learning rules have potential uses in RL (Precup & Sutton, 1997), but it remains open whether these algorithms converge or whether worst-case error bounds similar to the ones given in this paper can be obtained. Two technical lemmas are useful to prove Lemma A.4. It should be noted that the bounds they give are tight. Lemma A.2. For D given in Eqn 5, let A be D D or DD , and B be D-1 D- aor D- D-1 . ( Then, (A) 1 .- )2 , (1 + )2 nd (B ) ( 1 + )-2 , (1 - )-2 Proof. It can be verified that D D equals 1 - 0 ··· 0 - 1 + 2 - ··· 0 .. . . 0 0 · · · 1 + 2 - 0 0 ··· - 1 + 2 DD Since D D is symmetric, R. It follows from Gergorin's theorem (Hor( & Johnson, 1986. s n , DD Theorem 6.1.1) that D 1 - )2 , (1 + )2 The same holds for D . The second part follows D D -1 immediately by observing that D-1 D- = D 1 and D- D-1 = D -. Lemma A.3. Let D be given by Eqn 5, then . 2 2 -1 -) (D + D , 1+ 1- Proof. It can be verified that D-1 + D- 0 B B B B G=B B T -3 B @ T -2 T -1 0 B B B B B B B B B @ 1 1- 2 - 1- 2 A. Lemmas and Pro ofs Lemma A.1. This lemma col lects a few basic facts useful in our analysis (D is given in Eqn 5): 1. In al l three algorithms, d = De . 2. In TD (0), dt = (1 - xt xt+1 )(et - et+1 ). e - et+1 3. In RG, dt = 1+ (xtt- xt+1 ) xt+1 . Proof. 1. Since yt = rt + yt+1 , we have du t = rt + u xt+1 - u xt =y -y x x = t-u t t - rt - u t+1 =y -y x x t+1 - u t+1 t-u t eu - eu 1 . t t+ u u u u e 2 2 T -4 T -3 T -2 . ··· ··· ··· .. 2 ··· ··· 2 2 T -3 2 T -2 1 T -2 C C C C C, 2 C C A 2 T -1 quals and that (G - I )-1 equals - 1- 2 1+ 2 1- 2 0 - 1- 2 ··· ··· 1+ 2 1- 2 - 1- 2 0 0 - 1- 2 1+ 2 1- 2 - 1- 2 0 0 1 C C C C C C. C C C A .. 0 0 0 0 0 0 In matrix form, this is d = De . 2. Since wt = wt+1 - wt and yt = rt + yt+1 , dt = rt + wt xt+1 - wt xt = rt + (wt+1 - wt ) xt+1 - wt xt + (yt - rt - yt+1 ) = (yt - wt xt ) - (yt+1 - wt+1 xt+1 ) - wt xt+1 = et - et+1 - dt xt xt+1 . 1 - xt xt+1 . 0 - 1- 2 1 1- 2 ··· ··· ··· 0 Clearly, (G - I )-1 is symmetric, and it follows from Gergorin's theorem that s . 1 ( - 1+ G - I )-1 , 1+ 1- Therefore, = 1 + 1- 1 -1 , . 1 - 1+ -1 Reorganizing terms will complete the proof. 3. Similar to the proof for part (2) except that wt is computed by Eqn 3. (G - I ) - 1+ , 1+ 1- A Worst-Case Comparison b etween TD and RG with Linear Function Approximation Consequently, = 2 . 1 1- 1+ 2 ,1 + , (G) + 1+ 1- 1+ 1- We are now ready to prove the following lemma. w X 1 2 Lemma A.4. (M1 ) - 12 + 2 + b(1- )2 + here M1 is given in Eqn 6. Proof. By Weyl's theorem (Horn & Johnson, 1986, Theorem 4.3.1), 2 2 2 + + (M1 ) XI D-1 D- b - D-1 . + D- The lemma then follows immediately from Lemmas A.2 and A.3. Lemma A.5. Let M2 be defined by Eqn 10 and suppose the step-size is given by Eqn 11, then (M2 ) - 2 Acknowledgment We thank Michael Littman, Hengshuai Yao, and the anonymous reviewers for helpful comments that improved the presentation of the paper. The author is supported by NSF under grant IIS-0325281. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. Proceedings of the Twelfth International Conference on Machine Learning (ICML-95) (pp. 30­37). Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Athena Scientific. Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. Advances in Neural Information Processing Systems 7 (NIPS-94) (pp. 369­376). Cesa-Bianchi, N., Long, P. M., & Warmuth, M. (1996). Worst-case quadratic loss bounds for prediction using linear functions and gradient descent. IEEE Transactions on Neural Networks, 7, 604­619. Horn, R. A., & Johnson, C. R. (1986). Matrix analysis. Cambridge University Press. Kivinen, J., & Warmuth, M. K. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132, 1­63. Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107­1149. Munos, R. (2003). Error bounds for approximate policy iteration. Proceedings of the Twentieth International Conference on Machine Learning (ICML-03) (pp. 560­567). Precup, D., & Sutton, R. S. (1997). Exponentiated gradient methods for reinforcement learning. Proceedings of the Fourteenth International Conference on Machine Learning (ICML-97) (pp. 272­277). Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley-Interscience. Schapire, R. E., & Warmuth, M. K. (1996). On the worstcase analysis of temporal-difference learning algorithms. Machine Learning, 22, 95­122. Schoknecht, R., & Merke, A. (2003). TD(0) converges provably faster than the residual gradient algorithm. Proceedings of the Twentieth International Conference on Machine Learning (ICML-03) (pp. 680­687). Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9­44. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42, 674­ 690. (1 - )2 X 2+ (1 + 2 )2 1 b . Proof. Let b and = 2 X 2 (1 + )2 - 2 , then = M2 = D DD + I D. It is known that (M2 ) = max v1 M2 v1 . v1 =0 v1 v1 + Define v2 = Dv1 and we have: v2 DD (M2 ) = max v2 =0 v2 D- I v2 D-1 v 2 2 (1 - ) v2 DD + I v2 max , v2 =0 v2 v2 where the last step is due to Lemma A.2 and the fact that M2 is negative-definite for 1. Similarly, we define v3 = v2 and use the fact that 1 2 0 v2 v2 = v3 -2 v3 + (1 + ) X 2 v3 2 to obtain: (M2 ) = v3 =0 max (1 - )2 v3 DD + 2 I v3 (1 + (1 + ) X 2 ) v3 v3 ( (1 - )2 DD + I 1 + (1 + ) X 2 ) ( (1 - )2 (1 + )2 + 1 + (1 + ) X 2 ) 2 2 . If we choose as in Eqn 11, then the lemma follows immediately from the fact that 1+ X X 2+ (1 + ) 2 1 b 1+ 1 + 2 = . 1+ 1+