A Semiparametric Statistical Approach to Mo del-Free Policy Evaluation
Tsuyoshi Ueno tsuyos-u@sys.i.kyoto-u.ac.jp Motoaki Kawanab e motoaki.kawanabe@first.fraunhofer.de Takeshi Mori tak-mori@sys.i.kyoto-u.ac.jp Shin-ichi Maeda ichi@sys.i.kyoto-u.ac.jp Shin Ishii ishii@i.kyoto-u.ac.jp  Graduate School of Informatics, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan  Fraunhofer FIRST, IDA, Kekul´str. 7, 12489 Berlin, Germany e

Abstract
Reinforcement learning (RL) methods based on least-squares temporal difference (LSTD) have been developed recently and have shown good practical performance. However, the quality of their estimation has not been well elucidated. In this article, we discuss LSTDbased policy evaluation from the new viewpoint of semiparametric statistical inference. In fact, the estimator can be obtained from a particular estimating function which guarantees its convergence to the true value asymptotically, without specifying a model of the environment. Based on these observations, we 1) analyze the asymptotic variance of an LSTD-based estimator, 2) derive the optimal estimating function with the minimum asymptotic estimation variance, and 3) derive a suboptimal estimator to reduce the computational burden in obtaining the optimal estimating function.

parametric models. Linear function approximation has mostly been used due to their simplicity and computational convenience. To estimate the value function with a linear model, an online procedure called temporal difference (TD) learning (Sutton & Barto, 1998) and a batch procedure called least-squares temporal difference (LSTD) learning are widely used (Bradtke & Barto, 1996). LSTD can achieve fast learning, because it uses entire sample tra jectories simultaneously. Recently, efficient procedures for policy improvement combined with policy evaluation by LSTD have been developed, and have shown good performance in realistic problems. For example, the least squares policy iteration (LSPI) method maximizes the Q-function estimated by LSTD (Lagoudakis & Parr, 2003), and the natural actor-critic (NAC) algorithm uses the natural policy gradient obtained by LSTD (Peters et al., 2005). Although variance reduction techniques have been proposed for other RL algorithms (Greensmith et al., 2004; Mannor et al., 2007), the important issue of how to evaluate and reduce the estimation variance of LSTD learning remains unresolved. In this article, we discuss LSTD-based policy evaluation in the framework of semiparmetric statistical inference, which is new to the RL field. Estimation of linearly-represented value functions can be formulated as a semiparametric inference problem, where the statistical model includes not only the parameters of interest but also additional nuisance parameters with innumerable degrees of freedom (Godambe, 1991; Amari & Kawanabe, 1997; Bickel et al., 1998). We approach this problem by using estimating functions, which provide a well-established method for semiparametric estimation (Godambe, 1991). We then show that the instrumental variable method, a technique used in LSTD

1. Intro duction
Reinforcement learning (RL) is a machine learning framework based on reward-related interactions with environments (Sutton & Barto, 1998). In many RL methods, policy evaluation, in which a value function is estimated from sample tra jectories, is an important step for improving a current policy. Since RL problems often involve high-dimensional state spaces, the value functions are often approximated by low-dimensional
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


A Semiparametric Statistical Approach to Mo del-Free Policy Evaluation

learning, can be constructed from an estimating function which guarantees its consistency (asymptotic lack of bias) by definition. As the main results, we show the asymptotic estimation variance in a general instrumental variable method (Lemma 2) and the optimal estimating function that yields the minimum asymptotic variance of the estimation (Theorem 1). We also derive a suboptimal instrumental variable, based on the idea of the c-estimator (Amari & Kawanabe, 1997), to reduce the computational difficulty of estimating the optimal instrumental variable (Theorem 2). As a proof of concept, we compare the mean squared error (MSE) of our new estimators with that of LSTD on a simple example of the Markov decision process (MDP).

rewritten as V  (st ) = + where a p(st+1 |st ) := s s p(st+1 |st )r(st , st+1 ) - r Ż Ż
t+1 S

p(st+1 |st )V  (st+1 ),
t+1 S

(3)

 (st , at )p(st+1 |st , at ) and
a t A

t A

2. Background
2.1. MDPs and Policy Evaluation RL is an approach to finding an optimal policy for sequential decision-making in an unknown environment. We consider a finite MDP, which is defined as a quadruple (S , A, p, r): S is a finite set of states; A is a finite set of actions; p(st+1 |st , at ) is the transition probability to a next state st+1 when taking an action at at state st ; and r(st , at , st+1 ) is a reward received with the state transition. Let  (st , at ) = p(at |st ) be a stochastic policy that the agent follows. We introduce the following assumption concerning the MDP. Assumption 1. An MDP has a stationary state distribution d (s) = p(s) under the policy  (st , at ). There are two ma jor choices in definition of the state value function: discounted reward accumulation and average reward (Bertsekas & Tsitsiklis, 1996). With the former choice, the value function is defined as V  (s) := t E t rt+1 |s0 = s , (1)

. r(st , st+1 ) Ż := p(st+1 |st ) Throughout this article, we assume that the linear function approximation is faithful, and discuss only asymptotic estimation variance. (In general cases, bias becomes non-negligible and selection of basis functions is more important.) Assumption 2. The value function can be represented as a linear function of some features: V  (st ) = (st )  =   , t
m

P

 (st ,at )p(st+1 |st ,at )r (st ,at ,st+1 )

(4)

where (s) : S  R is a feature vector and   Rm is a parameter vector. Here, the symbol  denotes a transpose and the dimensionality of the feature vector m is smaller than the number of states |S |. Substituting eq. (4) for eq. (3), we obtain the following equation     s t - p(st+1 |st )t+1 =   t+1 S s Ż p(st+1 |st )r(st , st+1 ) - r. Ż (5)
t+1 S

W" en the matrix h
E t -
st+1 S

=0

where E [·|s0 = s] is the expectation with respect to the sample tra jectory conditioned on s0 = s and rt+1 := r(st , at , st+1 ).   [0, 1) is the discount factor. With the latter choice, on the other hand, the value function is defined as V (s) := where r := Ż s a


is non-singular and p(st+1 |st ) is known, we can easily obtain the parameter  . However, since p(st+1 |st ) is unknown in normal RL settings, we have to estimate this parameter from the sample tra jectory {s0 , a0 , r1 , · · · , sN -1 , aN -1 , rN } alone, instead of using it directly. Eq. (5) can be rewritten as yt = x  + t , t where yt , xt and t are defined as (6)

P

p(st+1 |st )t+1

!

t -

st+1 S

P

p(st+1 |st )t+1

! #

t s

E [rt+1 - r|s0 = s] , Ż


(2)

=0

d (s) (s, a)p(s |s, a)r(s, a, s )
 S

S A

denotes the average reward over the stationary distribution. According to the Bellman equation, eq. (2) can be

yt := rt+1 - r, xt := t - t+1 Ż     s p(st+1 |st )t+1 t := t+1 -    t+1 S s + rt+1 - p(st+1 |st )r(st , st+1 ). Ż
t+1 S

(7)


A Semiparametric Statistical Approach to Mo del-Free Policy Evaluation

When we use the discounted reward accumulation for the value function, eq. (6) also holds with yt := rt+1 , xt := t -  t+1     s t :=  t+1 - p(st+1 |st )t+1    t+1 S s + rt+1 - p(st+1 |st )r(st , st+1 ). Ż
t+1 S

p(x) and the conditional distribution p(y |x) of output y given x, respectively. Then, the joint distribution becomes p(x, y ;  , kx , k ) = p(y |x;  , k )p(x; kx ). (11)

(8)

Because E [t ] = 0, eq. (6) can be seen as a linear regression problem, where x, y and  are an input, an output and observation noise, respectively (Bradtke & Barto, 1996). Note that E [t g (st , st-1 , · · · , s0 )] = 0 (9)

holds for any function g (st , st-1 , · · · , s0 ) because of the Markov property. The regression problem (6) has an undesirable property, however, which is known as an "error-in-variable problem" (Young, 1984): the input xt and observation noise variables t are mutually dependent. It is not easy to solve such an error-in-variable problem in a rigorous manner; the simple least-squares method lacks consistency. Therefore, LSTD learning has used the instrumental variable method (Bradtke & Barto, 1996), a standard method to solve the error-in-variable problem that employs an "instrumental variable" to remove the effects of correlation between the input and the observation noise. When X = [x0 , x1 , · · · , xN -1 ] and y = [y0 , y1 , · · · , yN -1 ] , the estimator of the instrumental variable method is given by ^  = [Z X  ]-1 [Z y ], (10)

We would like to estimate the parameter  representing the value function in the presence of the extra unknowns kx and k , which can have innumerable degrees of freedom. Statistical models which contain such (possibly infinite-dimensional) nuisance parameters in addition to parameters of interest are called semiparametric (Bickel et al., 1998). In semiparametric inference, one established way of estimating parameters is to employ an estimating function (Godambe, 1991), which can give a consistent estimator of  without estimation of the nuisance parameters kx and k . Now we begin with a short overview of the estimating function in the simple i.i.d. case, and then discuss the Markov chain case. We consider a general semiparametric model p(x| , ), where  is an m-dimensional parameter and  is a nuisance parameter. An m-dimensional vector function f (x;  ) is called an estimating function when it satisfies the following conditions for any  , ; E[f (x;  )| , ] = 0 =  f (x;  ) ,  0 det  | < E |f (x;  )||2 | ,  , E  (12) (13) (14)

where Z = [z0 , z1 , · · · , zN -1 ], and zt is an instrumental variable that is assumed to be correlated with the input xt but uncorrelated with the observation noise t . 2.2. Semiparametric Mo del and Estimating Functions In the error-in-variable problem, if it is possible to assume a reasonable model with a small number of parameters on the joint input-output probability p(x, y ), a proper estimator with consistency can be obtained by the maximum likelihood method. Since the transition probability p(st+1 |st ) is unknown and usually difficult to estimate, it is practically impossible to construct such a parametric model. Let kx and k be parameters which characterize the input distribution

where E[·| , ] denotes the expectation with respect to x, which obeys the distribution p(x;  , ). The notations det | · | and || · || denote the determinant and the Euclidean norm, respectively. Consider that i.i.d. samples {x0 , x1 , · · · , xN -1 } are obtained from the true model p(x;  =   ,  =  ) = p(x;   ,  ) for the observed tra jectory. If there is an estimating function f , by solving the estimating equation
N -1 t =0

^ f (xt ;  ) = 0,

(15)

^ we can obtain an estimator  with good asymptotic properties. A solution of eq. (15) is called an "Mestimator" in statistics; the M-estimator is consistent, i.e., converges to the true parameter   regardless of the true nuisance parameter  when the sample size N reaches infinity. In addition, the asymptotic vari^ ance AV[ ] is given by ^ ^ ^ AV[ ] = E[( -   )( -   ) ] = 1 -1 A M A-  , N (16)


A Semiparametric Statistical Approach to Mo del-Free Policy Evaluation

a  where A = E   f (x;  )|  ,  f . nd M = E (x;  )f  (x;  )|  ,  The symbol - denotes transpose of the inverse matrix. We omit the time index t, unless it is necessary to clarify. Note that the asymptotic variance AV depends on the true parameters,   and  , not on the samples {x0 , x1 , · · · , xN -1 }. The notion of the estimating function can be extended to cases in which samples are given by a certain stochastic process (Godambe, 1985). In the semiparametric model for policy evaluation, under Assumption 1, there exist sufficient conditions of estimating functions which are almost the same as eqs. (12) - (14). The instrumental variable method is a type of estimating function method for semiparametric problems where the unknown distribution is given by eq. (11). Lemma 1. Suppose {xt , yt } is given by eq. (7) or (8), and zt is given by a function of| {st , · · · , st-T }. i f I E [zt x ] is nonsingular and E |zt (x  - yt )||2 s t t finite, then zt (x  - yt ) t (17)

 where AIV = Ed [zt x ], MIV = Ed [( )2 zt zt ]. Ed [·] t t denotes the expectation when the sample trajectory starts from the stationary distribution d (s0 ). 1

Pro of The estimating equation (18) can be expressed as ^ Z y = Z X  (21)

where Z = [z0 , · · · , zN -1 ], X = [x0 , · · · , xN -1 ], y = [y0 , · · · , yN -1 ] . On the other hand, from eq. (6), the left hand side of eq. (21) is equal to Z X    + Z X  , where  = [ , · · · ,  -1 ] . Thus the asymptotic vari0 N ^ ance of the estimator  is obtained as ^ ^ E [( -   )( -   ) ] = E [(Z X  )-1 Z  Z  (X Z  )-1 ] 1 N  - - A-1 2 E [Z  Z  ]AIV IV N where we used the fact that the matrix Z X  has the limit
N -1 1t 1 N   (Z X ) = zt x - AIV . t N N =0

is an estimating function for the parameter  . Therefore, the estimating equation is given by
N -1 t =0

Also, the matrix E [Z  Z  ] has the following limit:
N -1 1t 1  N  E [Z  Z  ] = E [( )2 zt zt ] - MIV , t N N =0

zt (x  - yt ) = 0. t

(18)

Pro of For all t, the conditions corresponding to (13) and (14) are satisfied by the assumptions, and the condition (12) is satisfied as  E [zt (xt  - yt )] = E [zt t ] = 0 from the property in eq. (9). (Q.E.D.) LSTD is specifically an instrumental variable method in which the feature vector zt = (st ) = t is used as an instrumental variable: fLSTD = t (x  - yt ). t (19)

where we used the property in eq. (9). Therefore, we have
N  1 ^ ^ A-1 MIV A- . E [( -   )( -   ) ] - IV N IV

(Q.E.D.) To summarize, if we have an instrumental variable which satisfies the assumptions in Lemmas 1 and 2, we can obtain an M-estimator from the estimating equation (18) with the asymptotic variance eq. (20). When more than one instrumental variable exists, it is appropriate to choose the one whose estimator has the minimum asymptotic variance.

The solution of the estimating equation is an Mestimator, and its asymptotic variance is given as follows. Lemma 2. Let zt be a function of {st , ..., st-T } satisfying the two conditions in Lemma 1 and  = t x   - yt be the residual for the true parameter   . t ^ Then, the solution  of the estimating equation (18) has the asymptotic variance 1 - ^ AV[ ] = A-1 MIV AIV , N IV (20)

3. Main Results
In this section, we show that estimating functions for the semiparametric model of policy evaluation are limited to the type of equation used in the instrumental variable method. Furthermore, we derive the optimal
We remark that the definitions of AIV and MIV do not depend on the t. Nevertheless, we keep the time index t for clarification.
1


A Semiparametric Statistical Approach to Mo del-Free Policy Evaluation

instrumental variable having the minimum asymptotic variance of the estimation. We first remark on the invariance property of the instrumental variable method. Lemma 3. The value function estimation ^ ^ V (st ) =   =  [Z X  ]-1 [Z y ] is invariant with t t respect to the application of any regular linear transformation to either the instrumental variable zt or the basis functions t . Pro of Assume that the instrumental variable and the basis functions are both transformed by any  regular matrices Wz and W as zt = Wz zt and  t = W t . Noting that the linear transformation of t yields the linear transformation of the input x = W xt , the estimator of the instrumental varit able method given by eq. (10) becomes -^ ^   = [Z  (X  ) ]-1 [Z  y ] = W 1  . This means that the estimated value function is invariant as ^ ^ ( )   =   . (Q.E.D.) t t When the basis functions span over the whole space of functions of the state, any set of basis functions can be represented by applying a linear transformation to another set of basis functions. This observation leads to the following Corollary. Corollary 1. When the basis functions t span the whole space of functions of the state, the value function estimation is invariant with respect to the choice of basis functions and of the instrumental variable. An instrumental variable may depend not only on the current state st , but also on the previous states {st-1 , · · · , st-T }, because such an instrumental variable does not violate the condition, cov[zt , t ] = 0. However, we do not need to consider such instrumental variables, as the following Lemma shows. Lemma 4. Let zt (st , · · · , st-T ) be any instrumental variable depending on the current and previous states which satisfies the conditions in Lemmas 1 and 2. Then, there is necessarily an instrumental variable depending only on the current state whose corresponding estimator has equal or minimum asymptotic variance. Pro of We show that the conditional expectation ~ zt = E [zt |st ] which depends only on the current state st , gives an equally good or better estimator. The matrices in the asymptotic variance, eq. (20), can be calculated as = + ( ~ ~ Az Ed zt - zt )x Az = Ed zt x ~ t t ~ ~~ ~ Mz = Ed [( )2 (zt + zt - zt )(zt + zt - zt ) ] t ~ ~ = Mz + Ed [( )2 (zt - zt )(zt - zt ) ], ~ t

where we have used eq. (9). This implies that 1 ^ AV[z ] = A-1 Mz A- z Nz (Q.E.D.) Here, the inequality denotes the semipositive definiteness of the subtraction. Now, we consider the general form of estimating functions for inference of the value function. In the following, we consider only `admissible' estimating functions. More precisely, we discard `inadmissible' estimating functions whose estimators are always inferior to those of other estimating functions in the sense of asymptotic variance. To simplify analysis, we only consider the limited set of estimating functions which are defined on a one-step sample {s, a, s }. Prop osition 1. For the semiparametric model of eqs. (6), (7) or eqs. (6), (8), al l admissible estimating functions of only the one-step sample {s, a, s } must have the form of f = z (y - x  ), where z is any function which does not depend on s and satisfies the assumption in Lemma 1. Pro of Due to space limitation, we will just outline the proof. To be an estimating function, the function f mustssatisfy s a Ed [f ]= d (s) p(s |s, a) (s, a)f (s, a, s ) = 0.
S
 S

^~ 1 A-1 Mz A- = AV[z ]. ~z ~ ~ Nz

A

Because we can prove that the stationary distribution s d (s) takes any probability vector, d (s)v (s) = 0
S
 S

more, the Bellman equation (6) holds, whatever the p(s |s, a) is. To fulfil v = 0, f must have the form of f = z (y - x  ) + h, where z does not depend on s or a, and h is any function that satisfies a  (s, a)h(s, a, s ) = 0. However, the addition of
A

implies thsat va s) = 0 for any state s, where ( v (s) := p(s |s, a) (s, a)f (s, a, s ). FurtherA

such a function h necessarily enlarges the asymptotic variance of the estimation. Therefore, the admissible estimating function is restricted to the form of f = z (y - x  ). (Q.E.D)

We are currently working on the conjecture that whether Proposition 1 can be extended to general estimating functions depend on all previous states and actions. If this is true, from Lemma 4, it is sufficient to consider the instrumental variable method with zt depending only on the current state st for the semiparametric inference problem. Therefore, we next discuss the optimal instrument variable of this type in terms of asymptotic variance, which corresponds to the optimal estimating function.


A Semiparametric Statistical Approach to Mo del-Free Policy Evaluation

Algorithm 1 The pseudo code of gLSTD. gLSTD(D, ) // D = {s0 , r1 , · · · , sN -1 , rN }: Sample sequence // : Basis functions // Calculate the initial parameter and its residual -1 N -1 ^ N -1 t t ^ 0  t yt t x t // Calculate the estimator E [(t )2 |st ], E [xt |st ] ^ // of the conditional expectations // and construct the instrumental variable ^ zt  E [(t )2 |st ]-1 E [xt |st ] ^ // Calculate the parameter R -1 N -1 N -1 t ^g  t zt x ^ ^t zt yt  ^ eturn g
=0 =0

Algorithm 2 The pseudo code of LSTDc LSTDc(D, ) // D = {s0 , r1 , · · · , sN -1 , rN }: Sample sequence // : Basis functions // Calculate the initial parameter and its residual -1 N -1 ^ N -1 t t ^ 0  t yt t x t ^ t  x 0 - yt t // Construct the suboptimal // instrumental # ariable with #" timal shift " v" op # " ^ c-
N -1 P t=0

^ t  x 0 - yt t

=0

=0

=0

=0

2 t - ^t 2 ^t

"

^ ^ zt = t + c //Calculate the parameter R -1 N -1 N -1 t t ^ ^ ^t zt yt zt x c  ^ eturn c
=0 =0

N -1 P t=0

#" -

N -1 P t=0 t=0

2 t  ^t t 2  ^t t

N -1 P

#"

N -1 P t=0

N -1 P t=0

-1

 xt t

xt  t

#-1 "

N -1 P t=0

N -1 P t=0

xt #

#

xt

Theorem 1. The optimal instrumental variable gives the minimum asymptotic variance   (2 -1 (t -  E [t+1 |st ]) t ) |st E    (discounted reward accumulation)  ( - (22) zt =    )2 |st 1 (t - E [t+1 |st ]) E t   (average reward). The proof is given in Appendix A. Note that the definition of the optimal instrumental variable includes both the residual  and the conditional expectations t E [t+1 |st ] and E [( )2 |st ]. To make this estimat  tor practical, we replace the residual t with that of the LSTD estimator, and approximate the expectation, E [t+1 |st ] and E [( )2 |st ], by using function t approximation. We call this procedure "gLSTD learning" (see Algorithm 1 for its pseudo code). To avoid estimating the functions depending on the current state, E [t+1 |st ] and E [( )2 |st ], which apt pear in the instrumental variable, we simply replace them by constants. When z is an instrumental variable, addition of any constant value to z , z  = z + c, leads to another valid instrumental variable; because of Lemma 1, it is easily confirmed that fc = (zt + c)(x  - yt ) is an estimating function. t Therefore, obtaining the optimal constant c yields a suboptimal instrumental variable within instrumental variables produced by constant shifts. Theorem 2. The optimal shift is given by c := -
 Ed [( )2 zt ] - Ed [( )2 zt zt ]Ed [xt z ]-1 Ed [xt ] t t t .  Ed [( )2 ] - Ed [( )2 zt ]Ed [xt z ]-1 Ed [xt ] t t t (23)

1
F

2

3

4

r=0

r =1

r =0.5

r =0

igure 1. A four-state MDP.

The proof is given in Appendix B. In eq. (23), however, the residual  is again unknown; hence, we need to t approximate this, too, as in the gLSTD learning. We call this procedure "LSTDc learning" (see Algorithm 2 for its pseudo code).

4. Simulation Exp eriments
So far, we have discussed the asymptotic variance under the assumption that we have an infinite number of samples. In this section, we evaluate the performance of the proposed estimator in a practical situation with a finite number of samples. We use an MDP defined on a simple Markov random walk, which was also used in a previous study (Lagoudakis & Parr, 2003). This MDP incorporates a one-dimensional chain walk with four states (Figure 1). Two actions, "left"(L) and "right"(R), are available at every state. Rewards 1 and 0.5 are given when states `2' and `3' are visited, respectively. We adopt the simplest direct representation of states; the state variable took s = 1, s = 2, s = 3 or s = 4,


A Semiparametric Statistical Approach to Mo del-Free Policy Evaluation
10

Mean squared error (MSE)

8

Average 2.1733
6

Average 0.5311

Average 0.5333

4

2

0

LSTD

LSTDc

gLSTD

Figure 2. Simulation result.

which is guaranteed to be consistent regardless of the stochastic properties of the environments. Based on the optimal estimating functions in the two classes of estimating functions, we constructed two new policy evaluation methods called gLSTD and LSTDc. We also evaluated the asymptotic variance of the general instrumental variable methods for MDP. Moreover, we showed that the form of possible estimating functions for the value function estimation is restricted to be the same as those used in the instrumental variable methods. We then demonstrated, through an experiment using a simple MDP problem, that the gLSTD and LSTDc estimators reduce substantially the asymptotic variance of the LSTD estimator. Further work is necessary to construct procedures for policy updating based on evaluation by gLSTD and LSTDc. It should be possible to incorporate our proposed ideas into the least-squares policy iteration (Lagoudakis & Parr, 2003) and the natural actorcritic method (Peters et al., 2005).

when the corresponding state was visited. The value function was defined as the average reward, eq. (2), and was approximated by a linear function with a three-dimensional basis function: (s) = [s, s2 , s3 ] . The policy was set at random, and at the beginning of each episode an initial state was randomly selected according to the stationary distribution of this Markov chain. Under these conditions, we performed 100 episodes each of which consisted of 100 random walk steps. We evaluated the "mean squared error" (MSE) of the i ^ d (i)|V (i) - V  (i)|2 ; value function, i.e.,
 { 1, 2, 3, 4}

A. Pro of of Theorem 1: The Optimal Instrumental Variable
As shown in eq. (20), the asymptotic variance of the ^ estimator z is given by ^ AV[z ] = 1 -1 A Mz A- , z Nz

^ ^ ^ where V and V  denote V (i) = (s = i)    and V (i) = (s = i)  , respectively.

Figure 2 shows box-plots of the MSEs of LSTD, LSTDc, and gLSTD. For this example, estimators of the conditional expectations in gLSTD can be calculated by sample average in each state, because there were only four discrete states. In continuous state problems, however, estimation of such conditional expectations would become much harder. In Figure 2, the y -axis denotes the MSE of the value function. The center line and the upper and lower sides of each box denote the median of MSEs and the upper and lower quartiles, respectively. The number above each box represents the average MSE. There is significant difference between the MSE of LSTD and those of LSTDc and gLSTD. The estimators for LSTDc and for gLSTD both achieved a much smaller MSE than that for the ordinary LSTD.

 where Az := Ed [zt x ] and Mz := Ed [( )2 zt zt ]. If t t we add a small change t (st , · · · , st-T ) to the instrumental variable zt , the matrices become

Az+ = Az + Ed [t x ], t
  Mz+ = Mz + Ed [( )2 (t zt + zt t )]. t

5. Conclusion
In this study, we have discussed LSTD-based policy evaluation in the framework of semiparametric statistical inference. We showed that the standard LSTD algorithm is indeed an estimating function method

Therefore, the deviation of the trace of asymptotic variance can be calculated as = A- 1 - A  Tr -1 Mz A- Tr z+ Mz+ A-+ z z z   A -1 A- 1 - - - Tr t xt z Ed z Mz Az + A- 1 x  A- 1 - Tr t t z Mz Az z Ed = A-  (2 z  A- 1  Tr t t +  zt z z Ed t ) -   -  -1  (  2 z 2Ed t Az Az E t ) |st , · · · , st-T t  . 2Ed t A- A-1 Mz A- E [xt |st , · · · , st-T ] z z z

By using the condition that the deviation becomes 0 for any small change t (st , · · · , st-T ), the optimal instrumental variable can be obtained as ( -1  Mz A- E [xt |st ]. zt = E  )2 |st z t


A Semiparametric Statistical Approach to Mo del-Free Policy Evaluation

Considering Lemma 3, the optimal instrumental variable is restricted as ( -1   zt = E  )2 |st E [xt |st ], t or as its transformation by any regular matrix. Now, we show that eq. (22) also satisfies the global  optimality. Substituting zt to the matrix Az , we obtain
 Az = Mz A- F , z

References
Amari, S., & Kawanabe, M. (1997). Information geometry of estimating functions in semi-parametric statistical models. Bernoul li, 3, 29­54. Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena Scientific. Bickel, D., Ritov, D., Klaassen, C., & Wellner, J. (1998). Efficient and Adaptive Estimation for Semiparametric Models. Springer. Bradtke, S., & Barto, A. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22, 33­57. Godambe, V. (1985). The foundations of finite sample estimation in stochastic processes. Biometrika, 72, 419­428. Godambe, V. (Ed.). (1991). Estimating Functions. Oxford Science. Greensmith, E., Bartlett, P., & Baxter, J. (2004). Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5, 1471­1530. Horn, R., & Johnson, C. (1985). Matrix analysis. Cambridge University Press. Lagoudakis, M., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107­1149. Mannor, S., Simester, D., Sun, P., & Tsitsiklis, J. N. (2007). Bias and variance approximation in value function estimates. Management Science, 53, 308­ 322. Peters, J., Vijayakumar, S., & Schaal, S. (2005). Natural actor-critic. Proceedings of the 16th European Conference on Machine Learning (pp. 280­291). Sutton, R., & Barto, A. (1998). Reinforcement Learning: An Introduction. MIT Press. Young, P. (1984). Recursive Estimation and Timeseries Analysis. Springer-Verlag.

where F := Ed E [( )2 |st ]-1 E [xt |st ]E [x |st ] t t .

 Furthermore, the matrices at zt + t become

  Az + = Mz A- F + Ed [xt t ], z  1   Mz + = Mz A- F A- Mz + Mz A- Ed [xt t ] z z z 1  + Ed [t x ]A- Mz + Ed [( )2 t t ]. t t z

Therefore,
1 - 1  A- + Mz + Az + - A- Mz A- z z z   1 1 = A- + (Mz + - Az + A- Mz A- A + )A- + z z z z z 1    = A- + (Ed [( )2 t t ] - Ed [t x ]F -1 Ed [xt t ])A- + t t z z

EE 1 = A- + d  [( )2 |st ] t z ×  2 × t - E [(t ) |st ]-1 Ed [t x ]F -1 E [xt |st ] t A   2 -1 Ed [t x ]F -1 E [xt |st ] t - E [(t ) |st ] t
- z  +

0.

 The equality holds only when t (st )  zt . (Q.E.D.)

B. Pro of of Theorem 2: The Optimal c
By differentiating the trace of eq. (20), the optimal constant c must satisfy Ed [( )2 ]c + Ed [( )2 t ] = Mc A- Ed [xt ]. t t c Using the well-known matrix inversion lemma (Horn & Johnson, 1985), the solution can be obtained as eq. (23). In addition, the global optimality among those applied by constant shifts can be proved using a similar argument to that in Appendix A. (Q.E.D.)

Acknowledgments
The authors wish to thank the anonymous reviewers for their helpful comments and suggestions.