Should one compute the Temporal Difference fix point or minimize the Bellman Residual ? The unified oblique projection view

Bruno Scherrer LORIA - INRIA Lorraine - Campus Scientifique - BP 239 54506 Vandoeuvre-l`s-Nancy CEDEX e FRANCE

scherrer@loria.fr

Abstract
We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the one-step Temporal Difference fix-point computation (TD(0)) and the Bellman Residual (BR) minimization. We describe examples, where each method outperforms the other. We highlight a simple relation between the objective function they minimize, and show that while BR enjoys a performance guarantee, TD(0) does not in general. We then propose a unified view in terms of oblique projections of the Bellman equation, which substantially simplifies and extends the characterization of Schoknecht (2002) and the recent analysis of Yu & Bertsekas (2008). Eventually, we describe some simulations that suggest that if the TD(0) solution is usually slightly better than the BR solution, its inherent numerical instability makes it very bad in some cases, and thus worse on average.

Equation, and the minimization of the meansquare Bellman Residual (BR). In this article, we present some new analytical and empirical data, that shed some light on both approaches. The paper is organized as follows. Section 1 describes the MDP linear approximation framework and the two projection methods. Section 2 presents small MDP examples, where each method outperforms the other. Section 3 highlights a simple relation between the quantities TD and BR optimize, and show that while BR enjoys a performance guarantee, TD does not in general. Section 4 contains the main contribution of this paper: we describe a unified view in terms of oblique projections of the Bellman equation, which simplifies and extends the characterization of Schoknecht (2002) and the recent analysis of Yu & Bertsekas (2008). Eventually, Section 5 presents some simulations, that address the following practical questions: which of the method gives the best approximation? and how useful is our analysis for selecting it a priori?

1. Framework and Notations
The model We consider an MDP with a fixed policy, that is an uncontrolled discrete-time dynamic system with instantaneous rewards. We assume that there is a state space X of finite size N . When at state i  {1, .., N }, there is a transition probability pij of getting to the next state j. Let ik the state of the system at time k. At each time step, the system is given a reward  k r(ik ) where r is the instantaneous reward function, and 0 <  < 1 is a discount factor. The value at state i is defined as the total expected reN -1 k turn: v(i) := limN  E k=0  r(ik ) i0 = i . We write P the N × N stochastic matrix whose elements are pij . v can be seen as a vector of RN . v is known to be the unique fixed point of the Bellman operator: T v := r + P v, that is v solves the Bellman Equation v = T v and is equal to L-1 r where L = I - P .

Introduction
We consider linear approximations of the value function of the policy in the framework of Markov Decision Processes (MDP). We focus on two popular methods: the computation of the projected Temporal Difference fixed point (TD(0), TD for short), which Antos et al. (2008); Farahmand et al. (2008); Sutton et al. (2009) have recently presented as the minimization of the mean-square projected Bellman
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

TD or BR? The unified oblique projection view

Approximation Scheme When the size N of the state space is large, one usually comes down to solving the Bellman Equation approximately. One possibility is to look for an approximate solution v in some specific ^ small space. The simplest and best understood choice m is a linear parameterization: i, v (i) = j=1 wj j (i) ^ where m  N , the j are some feature functions that should capture the general shape of v, and wj are the weights that characterize the approximate value v . For ^ all i and j, write j the N -dimensional vector corresponding to the j th feature function and (i) the mdimensional vector giving the features of state i. For any vector of matrix X, denote X  its transpose. The following N × m feature matrix  = (1 . . . m ) =  ((i1 ) . . . (iN )) leads to write the parameterization of v in a condensed matrix form: v = w, where ^ w = (w1 , ..., wm ) is the m-dimensional weight vector. We will now on denote span () this subspace of RN and assume that the vectors 1 , ..., m form a linearly independent set. Some approximation v of v can be obtained by mini^ mizing v  v - v for some norm · , that is equiva^ ^ lently by projecting v onto span () orthogonally with respect to · . In a very general way, any symmetric positive definite matrix Q of RN induces a quadratic  norm · Q on RN as follows: v Q = v  Qv. It is well known that the orthogonal projection with respect to such a norm, which we will denote  · Q , has the following closed form:  · Q =  · Q where  · Q = ( Q)-1  Q is the linear application from RN to Rm that returns the coordinates of the projection of a point in the basis (1 , . . . , m ). With these notations, the following relations  · Q  = I and  · Q  · Q =  · Q hold. In an MDP approximation context, where one is modeling a stochastic system, one usually considers a specific kind of norm/projection. Let  = (i ) be some distribution on X such that  > 0 (it assigns a positive probability to all states). Let  be the diagonal matrix with the elements of  on the diagonal. Consider the orthogonal projection of RN onto the feature space span () with respect to the -weighted quadratic  N 2 = norm v  = v  v. For clarity j=1 i vi of exposition, we will denote this specific projection  :=  ·  =  where  :=  ·  = ( )-1  . Ideally, one would like to compute the "best" approximation vbest = wbest with wbest = v = L-1 r. ^ This can be done with algorithms like TD(1) / LSTD(1)(Bertsekas & Tsitsiklis, 1996; Boyan, 2002), but they require simulating infinitely long trajectories

and usually suffer from a high variance. The projections methods, which we focus on in this paper, are alternatives that only consider one-step samples. TD(0) fix point method The principle of the TD(0) method (TD for short) is to look for a fixed point of T , that is, one looks for vT D in the space ^ span () satisfying vT D = T vT D . Assuming that ^ ^ the matrix inverse below exists1 , it can be proved2 that vT D = wT D with ^ wT D = ( L)-1  r (1)

Antos et al. (2008); As pointed out by Farahmand et al. (2008); Sutton et al. (2009), when the inverse exists, the above computation is equivalent to minimizing for v  span () the TD ^ error ET D (^) := v - T v  down to 03 . v ^ ^ BR minimization method The principle of the Bellman Residual (BR) method is to look for v  ^ span () so that it minimizes the norm of the Bellman Residual, that is the quantity EBR (^) := v - T v  . v ^ ^ Since v is of the form w, it can be seen that EBR (^) = ^ v w - P w - r  = w - r  using the notation  = L. Using standard linear least squares arguments, one can see that the minimum BR is obtained for vBR = wBR with ^ wBR = ( )-1  r. (2)

Note that in this case, the above inverse always exists (Schoknecht, 2002).

2. Two simple examples
Example 1 Consider the 2 state MDP such that 0 1 . Denote the rewards r1 and r2 . One P = 0 1 r2 r2 thus have v(1) = r1 + 1- and v(2) = 1- . Consider the one-feature linear approximation with  = (1 2) , with uniform distribution  = (.5 .5) .   = 5 , 2 12 therefore  = 5 5 , and the weight of the best approx2+ imation is wbest = v = 1 r1 + 5(1-) r2 . This example 5 has been proposed by Bertsekas & Tsitsiklis (1996) in order to show that fitted Value Iteration can diverge if the samples are not generated by the stationary distribution of the policy. In (Bertsekas & Tsitsiklis, 1996), the authors only consider the case r1 = r2 = 0
1 This is not necessary the case, as the forthcoming Example 1 (Section 2) shows. 2 Section 4 will generalize this derivation. 3 This remark is also true if we replace ·  by any equivalent norm · . This observation lead Sutton et al. (2009) to propose original off-policy gradient algorithms for computing the TD solution.

TD or BR? The unified oblique projection view
10000

TD BR

TD BR

1000

10000 1000 err ratio 100 10 1 0.1 0

100

0.2

0.4 0.6 gamma

0.8

1

3 2.5 2 1.5 theta 1 0.5 0

10

than the TD method, which can give an unbounded error. One should however not conclude too quickly that BR is always better than TD. The literature contains several arguments in favor of TD, one of which is considered in the following Example.
1

err ratio

1 0 0.2 0.4 0.6 gamma 0.8

Figure 1. Error ratio (in log scale) between the TD/BR projection methods and the best approximation for Example 1, with respect to the discount factor  and the parameter  of the reward (Left). It turns out that these surfaces do not depend on  so we also draw the graph with respect to  only (Right).

so that this diverging result was true even though the exact value function v(0) = v(1) = 0 did belong to the feature space. In the case r1 = r2 = 0, the TD and BR methods do calculate the exact solution (we will see later that this is indeed a general fact when the exact value function belongs to the feature space). We thus extend this model by taking (r1 , r2 ) = (0, 0). As a scaling of the reward is translated exactly in the approximation, we consider the general form (r1 , r2 ) = (cos , sin ). Consider the TD solution: one has   = 1 1 , 2 (I - P ) = (1 - 2 1 - ), thus ( ) = 5 - 3 2 1 and  r = r2 + r2 . Eventually the weight of the 1 +2r TD approximation is wT D = r5-62 . One notices here that the value  = 5/6 is singular. Now, consider the BR solution. One can see that ( )-1 = (1-2)2 +(2-2)2 and  r = (1-2)r1 +(2-2)r2 . Thus, 2 2 the weight of the BR approximation is wBR = (1-2)r1 +(2-2)r2 (1-2)2 +(2-2)2 . For all these approximations, one can compute the squared error e with respect to the optimal solution v: For any weight w  {wbest , wT D , wBR }, e(w) = 1 v - w 2 = 2 (v(1) - w)2 + 1 (v(2) - 2w)2 . In Fig 2 ure 1, we plot the squared error ratios
e(wBR ) e(wbest ) e(wT D ) e(wbest )

Example 2 Sutton et al. (2009) recently described a 3-state MDP example where the TD method computes the best projection while BR does not. The idea behind this 3-state example can be described in a quite general way4 : Suppose we have a k + l-state MDP, of which the Bellman Equation has a block triangular structure: v1 = P1 v1 + r1 / v2 = P21 v1 + P22 v2 + r2 where v1  Rk and v2  Rl (the concatenation of the vectors v1 and v2 form the value function). Suppose also that the approximation subspace span () is Rk × S2 where S2 is a subspace of Rl . For the first component v1 , the approximation space is the entire space Rk . With TD, we obtain the exact value for the k first components of the value, while with Bellman residual minimization, we do not: satisfying the first equation exactly is traded for decreasing the error in satisfying the second one (which also involves v1 ). In an optimal control context, the example above can have quite dramatic implications, as v1 can be related to the costs at some future states accessible from those states associated with v2 , and the future costs are all that matters when making decisions. Overall, the two methods generate different types of biases, and distribute error in different manners. In order to gain some more insight, we now turn on to some analytical facts about them.

3. A Relation and Stability Issues
Though several works have compared and considered both methods (Schoknecht, 2002; Lagoudakis & Parr, 2003; Munos, 2003; Yu & Bertsekas, 2008), the following simple fact has, to our knowledge, never been emphasized per se: Proposition 1 The BR is an upper bound of the TD error, and more precisely: ^  span () , EBR (^)2 = ET D (^)2 + T v - T v 2 . v v v ^ ^  Proof This simply follows from Pythagore, as T v - ^ T v is orthogonal to span () and v - T v belongs to ^ ^ ^ span (). This implies that if one can make the BR small, then the TD Error will also be small. In the limit case where
4 The rest of this section is strongly inspired by a personal communication with Yu.

and

on a log scale (they are by definition greater than 1) with respect to  and . It turns out that these ratios do not depend on  (instead of showing this through painful arithmetic manipulations, we will come back to this point and prove it later on). This Figure also displays the graph with respect to  only. We can observe that for any choice of reward function and discount factor, the BR method returns a better value than the TD method. Also, when  is in 5 the neighborhood of 6 , the TD error ratio tends to  while BR's stays bounded. This Example shows that there exists MDPs where the BR is consistenly better

TD or BR? The unified oblique projection view

one can make the BR equal to 0, then the TD Error is also 0. One of the motivation for minimizing the BR is historically related to a well-known result of 1 ^ Williams & Baird (1993): ^, v - v   1- T v - v ^ v  . Since one considers the weighted quadratic norm ^ in practice5 , the related result6 that really makes sense  ^ ^ here is: ^, v - v   1- T v - v  where v ^ p C() := maxi,j ij is a "concentration coefficient", i that can be seen as some measure of the stochasticity of the MDP7 . This result shows that it is sound to minimize the BR, since it controls (through a constant) the approximation error v - vBR  . ^
C()

(Thompson, 1996) and it can be shown8 that P   C() (which can thus also be of the order  of N ). Consequently, P  and P  may be greater than 1, and thus the fixed point of the projected Bellman equation may not be well-defined. A known exception where the composition P has norm 1, is when one can prove that P  = 1 (as for instance when  is the stationary distribution of P ) and in this case we know from Bertsekas & Tsitsiklis (1996); Tsitsiklis & Van Roy (1997) that v - vT D ^


 1+ N 2


1 1 - 2

v - vbest  . ^

(3)

On the TD side, there does not exist any similar result. Actually, the fact that one can build examples (like Example 1) where the TD projection is numerically unstable implies that one cannot prove such a result. Proposition 1 allows to understand better the TD method: by minimizing the TD Error, one only minimizes one part of the BR, or equivalently this means that one does not care about the term T v - T v 2 ,  which may be interpreted as a measure of adequacy of the projection  with the Bellman operator T . In Example 1, the approximation error of the TD projection goes to infinity because this adequacy term dia verges. In (Munos & Szepesv´ri, 2008), the authors use an algorithm based on the TD Error and make an assumption on this adequacy term (there called the inherent Bellman error of the approximation space), so that their algorithm can be proved convergent. A complementary view on the potential instability of TD, has been referred to as a norm incompatibility issue (Bertsekas & Tsitsiklis, 1996; Guestrin et al., 2001), and can be revisited through the notion of concentration coefficient. Stochastic matrices P statisfy P  = 1, which makes the Bellman operator T contracting, and thus its fixed point is well-defined. The orthogonal projection with respect to ·  is such that   = 1. Thus P and  are of norm 1, but for different norms. Unfortunately, a general (tight) bound for linear projections is   
5 Mainly because it is computationnally easier than doing a max-norm minimization, see however (Guestrin et al., 2001) for an attempt of doing max-norm projection. 6 The proof is a consequence of Jensen's inequality and the arguments are very close to the ones in (Munos, 2003). 7 If  is the uniform law, then there always exists such a C()  (1, N ) where one recalls that N is the size of the state space; in such a case, C() is minimal if all next-states are chosen with the uniform law, and maximal as soon as there exists a deterministic transition. See (Munos, 2003) for more discussion on this coefficient.

Another notable such exception is when  max = 1, as in the so-called "averager" approximation (Gordon, 1995). However, in general, the stability of TD is difficult to guarantee.

4. The unified oblique projection view
In the TD approach, we consider finding the fixed point of the composition of an orthogonal projection  and the Bellman operator T . Suppose now we consider using a (non necessarily orthogonal) projection  onto span (), that is any linear operator that satisfies 2 =  and whose range is span (). In their most general form, such operators are called oblique projections and can be written X = X with X = (X  )-1 X  . The parameter X specifies the projection direction: precisely, X is the projection onto span () orthogonally to span (X). As for the orthogonal projections, the following relations X  = I and X X = X hold. Recall that L = I - P . We are ready to state the main result of this paper: Proposition 2 Write XT D =  and XBR = L. (1) The TD fix point computation and the BR minimization are solutions (respectively with X = XT D and X = XBR ) of the projected equation vX = ^ X T vX . (2) When it exists, the solution of this pro^ jected equation is the projection of v onto span () orthogonally to span (L X), i.e. formally vX = L X v. ^ Proof We begin by showing part (2). Writing vX = ^ wX , the fixed point equation is: wX = X (r + P wx ). Multiplying on both sides by X , one obtains: wX = X (r + P wx ) and therefore wX = (I - X P )-1 X r. Using the definition of X , one
One can prove that for all x, P x 2  x 2   P C() x 2 . The argument for the first inequality involves  Jensen's inequality and is again close to what is done in (Munos, 2003).
8

TD or BR? The unified oblique projection view

obtains: wX = = (I - (X  )-1 X  P )-1 (X  )-1 X  r (X  )(I - (X  )-1 X  P )
-1

Lemma 1 (Yu & Bertsekas (2008)) Let Y be an N × m matrix, and Z a m × N matrix, then Y Z 2 =   (Y  Y )(Z-1 Z  ) . (4) Thus, L X 2 = L X   -1  [( )(L X  (L X ) )] [ (X  L)-1 X  L-1 L X( L X)-1 ] [ABCB  ].
2 

X r

= (X  (I - P ))-1 X  r = (X  L)-1 X  Lv = L X v

= = =

where we enventually used r = Lv. The proof of part (1) now follows. The fact that TD is a special case with X =  is trivial by construction since then X is the orthogonal projection with respect to ·  . When X = L, one simply needs to observe from Equations 2 and 4 and the definition of  = L that wX = wBR . Beyond its nice and simple geometric flavour, a direct consequence of Proposition 2 is that it allows to derive tight error bounds for TD, BR, and any other method for general X. For any square matrix M , write (M ) its spectral radius. Proposition 3 For any choice of X, the approximation error satisfies: v - vX ^


Proposition 2 is closely related to the work of (Schoknecht, 2002), in which the author derived the following characterization of the TD and BR solutions: Proposition 4 (Schoknecht (2002)) The TD fix point computation and the BR minimization are orthogonal projections of the value v respectively induced by the seminorm · QT D 9 with QT D = L  L and by the norm · QBR with QBR = L L. This "orthogonal projection" characterization and our "oblique projection" characterization are in fact equivalent. On the one hand for BR, it is immediate to notice that  · QBR = L XBR . On the other hand for TD, writing Y = L XT D , one simply needs to notice that L XT D = Y = (Y  )-1 Y  = (Y  )-1 ( Y )-1 ( Y )Y  = ( Y Y  )-1  Y Y  =  · QT D . The work of Schoknecht (2002) suggests that TD and BR are optimal for different criteria, since both look for some v  span () that minimizes ^ v - v for some (semi)norm · . Curiously, our re^ sult suggests that neither is optimal, since neither uses the best projection direction X  := L-1  for which ^ vX  = L X  v =  v = vbest and this supports the ^ empirical evidence that there is no clear "winner" between TD and BR. Our main results, stated in Propositions 2 and 3, constitutes a revisit of the work of Yu & Bertsekas (2008), where the authors similarly derived error bounds for TD and BR. Our approach mimicks theirs: 1) we derive a linear relation between the projection v , the real value v and the best projection vbest , ^ ^ then 2) analyze the norm of the matrices involved in this relation in terms of spectral radius of small matrices (through Lemma 1, which is taken from (Yu & Bertsekas, 2008)). From a purely quantitative point of view, our bounds are identical to the ones derived there. Two immediate consequences of this quantitative equivalence are that, as in (Yu & Bertsekas,
This is a seminorm because the matrix QT D is only semidefinite (since  has rank smaller than m < N ). The corresponding projection can still be well defined (i.e. each point has exactly one projection) provided that span ()  {x; x QT D = 0} = {0}.
9

=


L X


(ABCB  )

v - vbest ^

 
(5)

v - vbest ^

where A =  , B = (X  L)-1 and C XL-1 L X are matrices of size m × m.

=

Thus, for any X, the amplification of the smallest error v - vbest  depends on the norm of the associated ^ oblique projection, which can be estimated as the spectral radius of the product of small matrices. A simple corollary of this Proposition is the following: if the real value v belongs to the feature space span () (in such a case v = vbest ) then all oblique projection methods ^ find it (^X = v). v Proof of Proposition 3 Proposition 2 implies that v - vX = (I - L X )v = (I - L X )(I -  )v. where ^ we used the fact that L X  =  since L X and  are projections onto span (). Taking the norm, one obtains v - vX   I - L X  v -  v  = ^ L X  v - vbest  where we used the definition of ^ vbest , and the fact that I - L X  = L X  since ^ L X is a (non-trivial) projection (see e.g. (Szyld, 2006)). Thus Equation 5 holds. In order to evaluate the norm in terms of small size matrices, one will use the following Lemma on the projection matrix L X = L X :

TD or BR? The unified oblique projection view

2008), (1) our bound is tight in the sense that there exists a worst choice for the reward for which it holds with equality, and (2) it is always better than that of Equation 3 from Bertsekas & Tsitsiklis (1996); Tsitsiklis & Van Roy (1997). However, our work is qualitatively different: by highlighting the oblique projection relation between v and v, not only do we pro^ vide a clear geometric intuition for both methods, but we also greatly simplify the form of the results and their proofs (see (Yu & Bertsekas, 2008) for details). Last but not least, there is globally a significant difference between our work and the two works we have just mentionned. The analysis we propose is unified for TD and BR (and even extends to potential new methods through other choices of the parameter X), while the results in (Schoknecht, 2002) and (Yu & Bertsekas, 2008) are proved independently for each method. We that hope our unified approach will help understanding better the pros and cons of TD, BR, and related alternative approaches.
 = 0.9
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 30 10 25 15 20 20 5 25 0 30 space dim. 15 proj. space dim. 10 30 10 25 15 20 20 5 25 0 30 space dim. 15 proj. space dim. 10 E[TD wins] 0.5

 = 0.9
1 0.8 0.6 0.4 0.2 0

E[good prediction] 0.5

 = 0.95 E[good prediction]
0.5 1 0.8 0.6 0.4 0.2 0 5 30 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim.

0 5 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim.

0

30

 = 0.99 E[good prediction]
0.5 1 0.8 0.6 0.4 0.2 0 5 30 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim. 30 1 0.8 0.6 0.4 0.2 0

 = 0.999E[good prediction]
0.5

0

0 5 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim.

Figure 3. Prediction of the best method through Prop. 3  = 0.9
100 10 1 5 0 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim. E[TD_err/BR_err] 1

 = 0.95
1000 100 10 1

E[TD_err/BR_err] 1

0 5 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim.

 = 0.95
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

30 E[TD wins] 0.5

30

 = 0.99
0 5 100 10 1

E[TD_err/BR_err] 1

 = 0.999 E[TD_err/BR_err]
1 100 10 0 1 5 30 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim. 0

0

5

 = 0.99
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

E[TD wins] 0.5

 = 0.999
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

30 E[TD wins] 0.5

10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim.

Figure 4. Expectation of eT D /eBR .
0 5 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim.

0 5 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim.

30

30

Figure 2. TD win ratio.

5. An Empirical Comparison
In order to further compare the TD and the BR projections, we have made some empirical comparison, which we describe now. We consider spaces of dimensions n = 2, 3, .., 30. For each n, we consider projections of dimensions k = 1, 2, .., n. For each (n, k) couple, we generate 20 random projections (through random matrices10  of size (n, k) and random weight vectors ) and 20 random (uncontrolled) chain like MDP: from each state i, there is a probability pi (chosen randomly uniformly on (0, 1)) to get to state i + 1 and a probability 1 - pi to stay in i (the last state is absorbing);
10 Each entry is a random uniform number between -1 and 1.

the reward is a random vector. For the 20 × 20 resulting combinations, we compute the real value v, its exact projection vbest , the TD fix point vT D , and the ^ ^ BR projection vBR . We then deduce the best error ^ e = v - vbest  , the TD error eT D = v - vT D  ^ ^ and the BR eBR = v - vBR  . We also compute ^ the bounds of Proposition 3 for both methods: bT D and bBR . Each such experiment is done for 4 different values of the discount factor : 0.9, 0.95, 0.99, 0.999. Using this raw data on 20 × 20 problems, we compute for each (n, k) couple some statistics, which we describe now. All the graphs that we display shows the dimension of the space N and of the projected space m on the x - y axes. The z axis correspond to the different statistics of interest. Figure 2 shows the proportion of sampled problems where TD method returns a better approximation than BR (i.e. the expectation of the indicator function of eT D < eBR ). It turns out that this ratio is

TD or BR? The unified oblique projection view  = 0.9
1000 100 10 1 5 30 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim. 30 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim. 0 E[TD_err/err]

 = 0.9
1000 100 10 1

E[BR_err/err]

all charts that this average relative quality of the TD fix point has lots of pikes (corresponding to numerical instabilities), while that of the BR method is smooth.
0

5

6. Conclusion and Future Work
We have presented the TD fix point and the BR minimization methods for approximating the value of some MDP fixed policy. We have described two original examples: in the former, the BR method is consistently better than the TD method, while the latter (which generalizes the spirit of the example of Sutton et al. (2009)) is best treated by TD. Proposition 1 highlights the close relation between the objective criteria that correspond to both methods. It shows that minimizing the BR implies minimizing the TD error and some extra "adequacy" term, which happens to be crucial for numerical stability. Our main contribution, stated in Proposition 2, provides a new viewpoint for comparing the two projection methods, and potential ideas for alternatives. Both TD and BR can be characterized as solving a projected fixed point equation and this is to our knowledge new for BR. Also, the solutions to both methods are some oblique projection of the value v and this is to our knowledge new for TD and BR. Eventually, this simple geometric characterization allows to derive some tight error bounds (Proposition 3). We have discussed the close relations of our results with those of Schoknecht (2002) and Yu & Bertsekas (2008), and argued that our work simplifies and extends them. Though apparently new to the Reinforcement Learning community, the very idea of oblique projections of fixed point equations has been studied in the Numerical Analysis community (see e.g. Saad (2003)). In the future, we plan to study more carefully this literature, and particularly investigate whether it may further contribute to the MDP context. Concerning the practical question of choosing among the two methods TD and BR, the situation can be summarized as follows: the BR method is sounder than the TD method, since the former has a performance guarantee while the latter will never have one in general. Extensive simulations (on random chainlike problems of size up to 30 states, and for many projection of all the possible space sizes) further suggest the following facts: (a) the TD solution is more often better than the BR solution; (b) however sometimes, TD failed dramatically; (c) overall, this makes BR better on average. Equivalently, one may say that TD is more risky than BR. Even if TD is more risky, there remains several reasons

 = 0.95
1000 100 10 1

E[TD_err/err]

 = 0.95
1000 100 10 0 1

E[BR_err/err]

0 5 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim.

5 30 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim. 30

 = 0.99
1000 100 10 1

E[TD_err/err]

 = 0.99
1000 100 10 0 1

E[BR_err/err]

0 5 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim.

5 30 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim. 30

 = 0.999
1000 100 10 1

E[TD_err/err]

 = 0.999
1000 100 10 0 1

E[BR_err/err]

0 5 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim.

5 30 10 25 15 20 5 25 0 30 20 15 proj. space dim. 10 space dim. 30

Figure 5. (Left) Expectation of eT D /e and (Right) of eBR /e.

1 consistently greater than 2 , which means that the TD method is usually better than the BR method. Figure 3 presents the ratio of time the bounds we have presented in Propostion 4 correctly guesses which method is the best (i.e. the expectation of the indicator function of [eT D < eBR ] = [bT D < bBR ]). Unless the feature space dimension is close to the state space dimension, the bounds do not appear very useful for such a decision. Figure 4 displays the expectation of eT D /eBR . One can observe that, on average, this expectation is bigger than 1, that is the BR tends to be better, on average, than the TD error. This may look contradictory with our interpretation of Figure 2, but the explanation is the following: when the BR method is better than the TD method, it is by a larger gap than when it is the other way round. We believe this corresponds to the situation when the TD method in unstable. Figure 5 allows to confirm this point: it shows the expectation of the relative approximation errors with respect to the best possible error, that is the expectation of eT D /e and eBR /e. One observes on

TD or BR? The unified oblique projection view

why one may want to use it in practice, and which our study did not focus on. In large scale problems, one usually estimates the m × m linear systems through sampling. Sampling based methods for BR are more constraining since they generally require double sampling. Independently, the fact, highlighted by Propostion 1, that the BR is an upper bound of the TD error, suggests two things. First, we believe that the variance of the BR problem is higher than that of the TD problem; thus, given a fixed amount of samples, the TD solution might be less affected by the corresponding stochastic noise than the BR one. More generally, the BR problem may be harder to solve than the TD problem, and from a numerical viewpoint, the latter may provide better solutions. Eventually, we only discussed the TD(0) fix point method, that is the specific variant of TD() (Bertsekas & Tsitsiklis, 1996; Boyan, 2002) where  = 0. Values of  > 0 solve some of the weaknesses of TD(0): it can be show that the stability issues disappear for values of  close to 1, and the optimal projection vbest is obtained when  = 1. Fur^ ther analytical and empirical comparisons of TD() with the algorithms we have considered here (and with some "BR()" algorithm) constitute future research. Eventually, a somewhat disappointing observation of our study is that the bounds of Proposition 3, which are the tightest possible bounds independent of the reward function, did not prove useful for deciding a priori which of the two methods one should trust better (recall the results showed in Figure 3). Extending them in a way that would take the reward into account, as well as trying to exploit our original unified vision of the bounds (Propositions 2 and 3) are some potential tracks for improvement.

ral difference learning. Machine Learning, 49:233­ 246, 2002. Farahmand, A.M., Ghavamzadeh, M., Szepesv´ri, C., a and Mannor, S. Regularized policy iteration. In NIPS, 2008. Gordon, G. Stable function approximation in dynamic programming. In ICML, 1995. Guestrin, C., Koller, D., and Parr, R. Max-norm projections for factored mdps. In IJCAI, 2001. Lagoudakis, M. G. and Parr, R. Least-squares policy iteration. JMLR, 4:1107­1149, 2003. Munos, R. Error bounds for approximate policy iteration. In ICML, 2003. Munos, R´mi and Szepesv´ri, Csaba. Finite-time e a bounds for fitted value iteration. JMLR, 9:815­857, 2008. ISSN 1532-4435. Saad, Y. Iterative Methods for Sparse Linear Systems, 2nd edition. SIAM, Philadelpha, PA, 2003. Schoknecht, R. Optimality of reinforcement learning algorithms with linear function approximation. In NIPS, pp. 1555­1562, 2002. Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesv´ri, C., and Wiewiora, a E. Fast gradient-descent methods for temporaldifference learning with linear function approximation. In ICML, 2009. Szyld, D.B. The many proofs of an identity on the norm of oblique projections. Numerical Algorithms, 42:309­323, 2006. Thompson, A.C. Minkowski Geometry. Cambridge University Press, 1996. Tsitsiklis, J.N. and Van Roy, B. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674­690, 1997. Williams, R. J. and Baird, L. C. Tight performance bounds on greedy policies based on imperfect value functions. Technical report, College of Computer Science, Northeastern University, 1993. Yu, H. and Bertsekas, D.P. New error bounds for approximations from projected linear equations. Technical Report C-2008-43, Dept. Computer Science, Univ. of Helsinki, July 2008.

Acknowlegments
The author would like to thank Janey Yu for helpful discussions, and the anonymous reviewers for providing comments that helped to improve the presentation of the paper.

References
Antos, A., Szepesv´ri, C., and Munos, R. Learna ing near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89­129, 2008. Bertsekas, D.P. and Tsitsiklis, J.N. Neurodynamic Programming. Athena Scientific, 1996. Boyan, J. A. Technical update: Least-squares tempo-