Analysis of a Classification-based Policy Iteration Algorithm

Alessandro Lazaric alessandro.lazaric@inria.fr Mohammad Ghavamzadeh mohammad.ghavamzadeh@inria.fr R´mi Munos e remi.munos@inria.fr SequeL Project, INRIA Lille-Nord Europe, 40 avenue Halley, 59650 Villeneuve d'Ascq, France

Abstract
We present a classification-based policy iteration algorithm, called Direct Policy Iteration, and provide its finite-sample analysis. Our results state a performance bound in terms of the number of policy improvement steps, the number of rollouts used in each iteration, the capacity of the considered policy space, and a new capacity measure which indicates how well the policy space can approximate policies that are greedy w.r.t. any of its members. The analysis reveals a tradeoff between the estimation and approximation errors in this classification-based policy iteration setting. We also study the consistency of the method when there exists a sequence of policy spaces with increasing capacity.

1996; Lagoudakis & Parr 2003a). The main drawbacks of this approach are: 1) the action-value function, Qk , is not known in advance and its high quality samples are often very expensive to obtain, if this option is possible at all, 2) it is often difficult to find a function space rich enough to represent the action-value function accurately, and thus, careful hand-tuning is needed to achieve satisfactory results, 3) for the success of policy iteration, it is not necessary to estimate Qk accurately at every state-action pair, what is important is to have a performance similar to the greedy policy, and 4) this method may not be the right choice in domains where good policies are easier to represent and learn than the corresponding value functions. To address the above issues, mainly 3 and 4,1 variants of API have been proposed that replace the usual value function learning step (approximating the action-value function over the entire state-action space) with a learning step in a policy space (Lagoudakis & Parr, 2003b; Fern et al., 2004). The main idea is to cast the policy improvement step as a classification problem. The training set is generated using rollout estimates of Q over a finite number of states D = {xi }N , i=1 called the rollout set, and for any action a  A.2 For each x  D, if the estimated value Q (x, a ) is greater than the estimated value of all other actions with high confidence, the state-action pair (x, a ) is added to the training set with a positive label. In this case, (x, a) for the rest of the actions are labeled negative and added to the training set. The policy improvement step thus reduces to solving a classification problem to find a policy in a given hypothesis space that best predicts the greedy action at every state. Although whether selecting a suitable policy space is any easier than a value function space is highly debatable, we can argue that the classification-based API methods can be advantageous in problems where good policies are easier to represent and learn than their value functions.
1 The first drawback is shared by all reinforcement learning algorithms and the second one is common to all practical applications of machine learning methods. 2 It is worth stressing that Q is estimated just on states in D and not over the entire state-action space.

1. Introduction
Policy iteration (Howard, 1960) is a method of computing an optimal policy for any given Markov decision process (MDP). It is an iterative procedure that discovers a deterministic optimal policy by generating a sequence of monotonically improving policies. Each iteration k of this algorithm consists of two phases: policy evaluation in which the action-value function Qk of the current policy k is computed, and policy improvement in which the new (improved) policy k+1 is generated as the greedy policy w.r.t. Qk , i.e., k+1 (x) = arg maxaA Qk (x, a). Unfortunately, in MDPs with large (or continuous) state and action spaces, the policy evaluation problem cannot be solved exactly and approximation techniques are required. In approximate policy iteration (API), a function approximation scheme is usually employed in the policy evaluation phase. The most common approach is to find a good approximation of the value function of k in a real-valued function space (see e.g., Bradtke & Barto
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Analysis of a Classification-based Policy Iteration Algorithm

The classification-based API algorithms can be viewed as a type of reduction from reinforcement learning (RL) to classification, i.e., solving a MDP by generating and solving a series of classification problems. There have been other proposals for reducing RL to classification. Bagnell et al. (2003) introduced an algorithm for learning non-stationary policies in RL. For a specified horizon h, their approach learns a sequence of h policies. At each iteration, all policies are fixed except for one, which is optimized by forming a classification problem via policy rollout. Langford & Zadrozny (2005) provided a formal reduction from RL to classification, showing that -accurate classification implies near optimal RL. This approach uses an optimistic variant of sparse sampling to generate h classification problems, one for each horizon time step. The main limitation of this work is that it does not provide a practical method for generating training examples for these classification problems. Although the classification-based API algorithms have been successfully applied to benchmark problems (Lagoudakis & Parr, 2003b; Fern et al., 2004) and have been modified to become more computationally efficient (Dimitrakakis & Lagoudakis, 2008b), a full theoretical understanding of them is still lacking. Fern et al. (2006) and Dimitrakakis & Lagoudakis (2008a) provide a preliminary theoretical analysis of their algorithm. In particular, they both bound the difference in performance at each iteration between the learned policy and the true greedy policy. Their analysis is limited to one step policy update (they do not show how the error in the policy update is propagated through the iterations of the API algorithm) and either to finite class of policies (in Fern et al. (2006)) or to a specific architecture (a uniform grid in Dimitrakakis & Lagoudakis (2008a)). Moreover, the bound reported in (Fern et al., 2006) depends inversely on the minimum Q-value gap between a greedy and a sub-greedy action over the state space. In some classes of MDPs this gap can be arbitrarily small so that the learned policy can be arbitrarily worse than the greedy policy. In order to deal with this problem Dimitrakakis & Lagoudakis (2008a) assume the action-value functions to be smooth and the probability of states with a small Q-value gap to be small. In this paper, we derive a full finite-sample analysis of a classification-based API algorithm (called direct policy iteration (DPI)) based on a cost-sensitive loss function weighing each classification error by its actual regret, i.e., the difference between the action-value of the greedy action and the action chosen by DPI. Using this loss, we are able to derive a performance bound with no dependency on the Q-value gaps and no as-

sumption on the probability of small-gap states. Our analysis further extends the one in Fern et al. (2006) and Dimitrakakis & Lagoudakis (2008a) by considering arbitrary policy spaces. We also analyze the consistency of DPI when there exists a sequence of policy spaces with increasing capacity. We first use a counterexample and show that it is not consistent in general, and then prove that its consistency for the class of Lipschitz MDPs. We conclude the paper with a discussion on different theoretical and practical aspects of DPI.

2. Preliminaries
In this section we set the notation used throughout the paper. A discounted Markov Decision Process (MDP) M is a tuple X , A, r, p,  , where the state space X is a bounded closed subset of a Euclidean space Rd , the set of actions A is finite (|A| < ), the reward function r : X ×A  R is uniformly bounded by Rmax , the transition model p(·|x, a) is a distribution over X , and   (0, 1) is a discount factor. Let B V (X ; Vmax ) and B Q (X ×A; Qmax ) be the space of Borel measurable value functions and action-value functions bounded by Vmax and Qmax (Vmax = Qmax = Rmax ), respectively. 1- We also use B  (X ) to denote the space of deterministic policies  : X  A. The value function of a policy , V  , is the unique fixed-point of the Bellman operator T  : B V (X ; Vmax )  B V (X ; Vmax ) defined by
(T  V )(x) = r x, (x) + 
X

p dy|x, (x) V (y).

The action-value function Q is defined as
Q (x, a) = r(x, a) + 
X

p(dy|x, a)V  (y).

Similarly, the optimal value function, V  , is the unique fixed-point of the optimal Bellman operator T : B V (X ; Vmax )  B V (X ; Vmax ) defined as
(T V )(x) = max r(x, a) + 
aA X

p(dy|x, a)V (y) ,

and the optimal action-value function Q is defined by
Q (x, a) = r(x, a) + 
X

p(dy|x, a)V  (y).

We say that a deterministic policy   B  (X ) is greedy w.r.t an action-value function Q, if (x)  arg maxaA Q(x, a), x  X . Greedy policies are important because any greedy policy w.r.t. Q is optimal. We define the greedy policy operator G : B  (X )  B  (X ) as3
(G)(x) = arg max Q (x, a).
aA 3 In Eq. 1, the tie among the actions maximizing Q (x, a) is broken in an arbitrary but consistent manner.

(1)

Analysis of a Classification-based Policy Iteration Algorithm Input: policy space   B (X ), state distribution  Initialize: Let 0   be an arbitrary policy for k = 0, 1, 2, . . . do iid Construct the rollout set D = {xi }N , xi   i=1 for all states xi  D and actions a  A do for j = 1 to M do Perform a rollout according to policy k and return Rk (xi , a) = r(xi , a) + t1  t r xt , k (xt ) , j xt  p · |xt-1 , k (xt-1 ) and x1  p(·|xi , a) end for  1 Qk (xi , a) = M M Rj k (xi , a) j=1 end for (classifier) k+1 = arg min k () 1, end for Figure 1. The Direct Policy Iteration (DPI) algorithm.

is consistent with the final objective of policy iteration, which is to obtain a policy with similar performance to an optimal policy and not taking similar actions. 4 As illustrated in Figure 1, for each state xi  D and for each action a  A, an estimate of the action-value function of the current policy is computed through M independent rollouts. Given the outcome of the rollouts, the empirical loss is defined as Definition 2. For any x  D, the empirical loss function at iteration k for a policy  is
k (x; ) = max Qk (x, a) - Qk x, (x) ,
aA

In the analysis of this paper, G plays a role similar to the one played by the optimal Bellman operator, T , in the analysis of the fitted value iteration algoa rithm (Munos & Szepesv´ri 2008, Section 5).

where Q (x, a) is a rollout estimation of the Q-value of k in (x, a) as defined in Figure 1.5 Similar to Definition 1, the empirical error is defined as the L1, norm of the empirical loss function
k ()
1,

k

=

1 N

N

max Qk (xi , a) - Qk xi , (xi ) ,
i=1 aA

3. The DPI Algorithm
In this section, we outline the Direct Policy Iteration (DPI) algorithm. DPI shares the same structure as Lagoudakis & Parr (2003b) and Fern et al. (2004). Although the algorithm can benefit from improvements in 1) selecting states for the rollout set D, 2) the criteria used to add a sample to the training set, and 3) the rollout strategy, as discussed in Lagoudakis & Parr (2003b) and Dimitrakakis & Lagoudakis (2008b), here we consider its basic form in order to ease the analysis. In DPI, at each iteration k, a new policy k+1 is computed from k as the best approximation of Gk , by solving a cost-sensitive classification problem. More formally, DPI is based on the following loss function. Definition 1. The loss function at iteration k for a policy  is denoted by k (·; ) and is defined as
k (x; ) = max Qk (x, a) - Qk x, (x) ,
aA

where · 1, denotes the L1 -norm weighted by the empirical distribution  induced by the samples in D. Finally, DPI makes use of a classifier which returns a policy that minimizes the empirical error k () 1, over the policy space .

4. Finite-sample analysis of DPI
In this section, we first provide a finite-sample analysis of the error incurred at each iteration of DPI in Theorem 1, and then show how this error is propagated through the iterations of the algorithm in Theorem 2. In the analysis, we explicitly assume that the action space contains only two actions, i.e., A = {a1 , a2 } and |A| = 2. We will discuss this assumption and other theoretical and practical aspects of DPI in Section 6. 4.1. Error Bound at Each Iteration Here we study the error incurred at each iteration k of the DPI algorithm. Before stating the main result, we define the inherent greedy error of a policy space . Definition 3. We define the inherent greedy error of a policy space   B  (X ) as
d(, G) = sup inf || (  )||1, . 
  

x  X .

Given a distribution  over X , we define the expected error as the L1, -norm of the loss function k (·; ),
||k ()||1, =
X

max Qk (x, a) - Qk x, (x)
aA

(dx). (2)

While in Lagoudakis & Parr (2003b) the goal is to minimize the number of misclassifications (i.e., they use 0/1 loss function), DPI learns a policy which aims at minimizing the loss k . Similar to other algorithms in classification-based RL (Fern et al., 2004; Li et al., 2007; Langford & Zadrozny, 2005), DPI does not focus on finding a uniformly accurate approximation of the actions taken by the greedy policy, but rather on finding actions leading to a similar performance. This

In other words, the inherent greedy error is the worst expected error that a loss-minimizing policy    
4 Refer to (Li et al., 2007) for a simple example in which an accurate polices might have a very poor performance w.r.t. the greedy policy. 5 Here we consider rollouts in which policy  is followed for an infinite number of steps or until a terminal state is reached. In practice, a finite horizon H is defined for the rollout, and thus, an additional term  H Qmax (vanishing with H) appears in the final bound.

Analysis of a Classification-based Policy Iteration Algorithm

can incur in approximating the greedy policy G,   . This measures how well  is able to approximate policies that are greedy w.r.t. any policy in . Lemma 1. Let  be a policy space with finite VCdimension h = V C() < , and Fk be the space of the loss functions at iteration k induced by the policies in , i.e., Fk = {k (·; );   }. Note that all functions k  Fk are uniformly bounded by Qmax . Let N > 0 be the number of states in the rollout set, D, drawn i.i.d. from the state distribution , then
P sup
k Fk

with probability 1 - , where
1 = 2 2 hQmax log
2eN h

+ log

8 

N

, 2 =

2Qmax 4 log . MN 

k

1,

- ||k ||1, >    ,
2 

with  = 2 2

hQmax log 2eN +log h N

.

Proof. (sketch) First we rewrite the loss function as k (x; ) = I {(Gk )(x) = (x)} k (x), where
k (x) = max Qk (x, a) - min Qk (x, a ) 
aA a A

(3)

is the gap between the two actions (the regret of choosing the wrong action). Since the loss function depends on the policy  only through the indicator function, we can directly relate the complexity of Fk to the complexity of . In particular, let
x Fk 1 :xN =

k (x1 ; ), . . . , k (xN ; ) ,   

be the set of possible values of the loss function on the rollout set, D = {xi }N , for the policies in . i=1 Then, the corresponding growth function, SFk (N ), is strictly related to the VC-dimension of . Indeed, x the cardinality of Fk 1 :xN depends only on S (N ), the number of combinations of the indicator function that can be induced by the policies in ,
SFk (N ) = sup
x1 ,...,xN x |Fk 1 :xN |  S (N ) 

Remarks: The bound in Eq. 4 can be decomposed into an approximation error d(, G) and an estimation error consisted of two terms 1 and 2 . This is similar to generalization bounds in classification, where the approximation error is the distance between the target function (here the greedy policy w.r.t. k ) and the function space . Here d(, G) represents the worst possible such distances. The first estimation term, 1 , grows with the capacity of , measured by its VC-dimension h, and decreases with the number of sampled states N . Thus in order to avoid overfitting, we should have N  h. The second estimation term, 2 , comes from the error in the estimation of the action-values due to the finite number of rollouts M .  It is important to note the nice rate of 1/ M N in stead of 1/ M . This is due to the fact that we do not need a uniformly good estimation of the action-value function at all sampled states, but only an averaged estimation of those values at the sampled points. An important consequence of this is that the algorithm works perfectly well if we consider only M = 1 rollout per state-action. Therefore, given a fixed budget of rollouts per iteration, the best allocation of M and N would be to choose M = 1 and sample as many states as possible, thus, reducing the risk of overfitting. Proof. Let a (·) = arg maxaA Qk (·, a) be the greedy action.6 We prove the following series of inequalities:
(a)

||k (k+1 )||1,  =
(b)

k (k+1 )
 k

1,

+ 1 + 1

w.p. 1 -  

eN h

h

.

1 N 1 N 1 N 1 N

N

Q
i=1 N

k

(xi , a ) - Q


xi , k+1 (xi )

The rest of the proof follows the same usual steps as in Vapnik & Chervonenkis (1971). We are now ready to prove the main result of this section. We show a high probability bound on the expected error at each iteration k of DPI. Theorem 1. Let  be a policy space with finite VCdimension h = V C() <  and  be a distribution over the state space X . Let N be the number of states in D drawn i.i.d. from , and M be the number of rollouts per state-action used in the estimation of the action-value functions. Let k+1 = arg min k () 1, be the policy computed at the kth iteration of DPI . Then, for any  > 0, we have
||k (k+1 )||1,  d(, G) + 2(1 + 2 ), (4)


(c)

Q
i=1 N

k

(xi , a ) - Q

k

xi , k+1 (xi )

+ 1 + 2

w.p. 1-2


(d)

Qk (xi , a ) - Qk xi ,   (xi )
i=1 N

+ 1 + 2
w.p. 1-3


Qk (xi , a ) - Qk xi ,   (xi )
i=1 (e)

+ 1 + 22

= k (  )
  

1,

+ 1 + 22  ||k (  )||1, + 2(1 + 2 ) w.p.
(f)

1-4

= inf ||k (  )||1, + 2(1 + 2 )  d(, G) + 2(1 + 2 ).

The statement of the theorem is obtained by   = /4. (a) It is an immediate application of Lemma 1, bounding the difference between ||k ||1, and k 1, .
6 To simplify the notation, we remove the dependency of a on states and use a instead of a (xi ) in the following.

Analysis of a Classification-based Policy Iteration Algorithm

(b) Here we introduce the estimated action-value function Qk by bounding
1 N
N

Q
i=1

k

1 (xi , a) - N

N

Q
i=1

k

(xi , a),

V  - V k+1 = T V  - T V k + T V k - T k+1 V k + T k+1 V k - T k+1 V k+1  P  (V  - V k ) + k (k+1 ) + P k+1 (V k - V k+1 ),

the difference between the true action-value function and its rollout estimates averaged over the states in the rollout set D = {xi }N . In particular, by using i=1 the Chernoff-Hoeffding inequality and by recalling the definition of Qk xi , k+1 (xi ) as the average of M rollouts, we obtain
1 MN
N M

which yields
V  - V k+1  P  (V  - V k ) + P k+1 (I - P k+1 )-1 + I k (k+1 ) = P  (V  - V k ) + (I - P k+1 )-1 k (k+1 ).

Rj k (xi , a)-
i=1 j=1


1 MN

N

M

Qk (xi , a) 
i=1 j=1

2Qmax 1 log  , MN 

with probability 1 -   . (c) From the definition of k+1 in the DPI algorithm (see Figure 1), we have
k+1 = arg min k ()
 1,

Finally, by defining the operator Ek = (I -P k+1 )-1 , which is well defined since P k+1 is a stochastic kernel and  < 1, and by induction, we obtain
V  - V K
K-1

(5) (P  )K-k-1 Ek k (k+1 ).
k=0

= arg max


1 N

N

Qk xi , (xi ) ,
i=1

 (P  )K (V  - V 0 ) +

thus, -1/N N Qk xi , k+1 (xi ) can be maximized by i=1 replacing k+1 with any other policy, particularly with
  = arg inf
   X

max Qk (x, a) - Qk x,   (x)
aA

(dx).

(d)-(f ) The final result follows by using Definition 3 and by applying the Chernoff-Hoeffding inequality and the regression generalization bound. 4.2. Error Propagation In this section, we first show how the expected error is propagated through the iterations of DPI. We then analyze the error between the value function of the policy obtained by DPI after K iterations and the optimal value function in µ-norm, where µ is a distribution over the states which might be different from the sampling distribution . Let P  be the transition kernel for policy , i.e., P  (dy|x) = p dy|x, (x) . It defines two related operators: a right-linear operator, P  ·, which maps any V  B V (X ; Vmax ) to (P  V )(x) = V (y)P  (dy|x), and a left-linear operator, ·P  , that returns (µP  )(dy) = P  (dy|x)µ(dx) for any distribution µ over X .

Eq. 5 shows how the error at each iteration k of DPI, k (k+1 ), is propagated through the iterations and appears in the final error of the algorithm, V  - V K . Since we are interested in bounding the final error in µ-norm, which might be different than the sampling distribution , we use one of the following assumptions: Assumption 1. For any policy    and any non-negative integers s and t, there exists a constant Cµ, (s, t) <  such that µ(P  )s (P  )t  Cµ, (s, t).   We define Cµ, = 1- s=0 t=0  s+t Cµ, (s, t). 2 Assumption 2. For any x  X and any a  A, there exist a constant C <  such that p(·|x, a)  C (·). Note that concentrability coefficients similar to Cµ, and C were previously used in the Lp -analysis of fitted a value iteration (Munos, 2007; Munos & Szepesv´ri, 2008) and approximate policy iteration (Antos et al., 2008). We now state our main result. Theorem 2. Let  be a policy space with finite VCdimension h and K be the policy generated by DPI after K iterations. Let M be the number of rollouts per state-action and N be the number of samples drawn i.i.d. from a distribution  over X at each iteration of DPI. Then, for any  > 0, we have
||V  - V K ||1,µ  2 Cµ, d(, G) + 2(1 + 2 ) 1- under Assumption 1

From the definitions of k , T  , and T , we have k (k+1 ) = T V k - T k+1 V k . We deduce the following pointwise inequalities:
V k - V k+1 = T k V k - T k+1 V k + T k+1 V k - T k+1 V k+1  k (k+1 ) + P k+1 (V k - V k+1 ),

+  K Rmax , ||V  - V K || 

2 C d(, G) + 2(1 + 2 ) 1- under Assumption 2

+  K Rmax ,

which gives us V We also have

k

-V

k+1

 (I - P

k+1 -1

)

k (k+1 ).

with probability 1 - , where

Analysis of a Classification-based Policy Iteration Algorithm

1 = 2

2

hQmax log

2eN h

+ log

8K 

N

, 2 =

4K 2Qmax log . MN 

Proof. We have Cµ,  C for any µ. Thus, if the L1 -bound holds for any µ, choosing µ to be a Dirac at each state implies that the L -bound holds as well. Hence, we only need to prove the L1 -bound. By taking the absolute value point-wise in Eq. 5 we obtain
|V  - V K |  (P  )K |V  - V 0 |
K-1

those actions. The main property of such a sequence of spaces is that any fixed policy  can be approximately arbitrary well as n increases. Although other definitions of universality could be used, Definition 4 seems natural and it is satisfied by widely-used classifiers such as k-nearest neighbor, uniform grids, and histograms. In (Lazaric et al., 2010), we show that universal spaces are not a sufficient condition to guarantee that d(n , Gn ) converges to zero in any MDP. On the other hand, in the next section we show that if the MDP is Lipschitz then d(n , Gn ) converges to zero for any universal family of policy spaces. 5.1. Lipschitz MDPs In this section, we prove that for Lipschitz MDPs, d(n , Gn ) goes to zero when {n } is a universal family of classifiers. We start by defining a Lipschitz MDP. Definition 5. A MDP is Lipschitz if both its transition probability and reward functions are Lipschitz, i.e., (B, x, x , a)  B(X ) × X × X × A
|r(x, a) - r(x , a)| |p(B|x, a) - p(B|x , a)|   Lr x - x , Lp x - x ,

+
k=0

(P  )K-k-1 (I - P k+1 )-1 |k (k+1 )|.
 0 2 1- Rmax 1,

From Assumption 1 and |V - V |  by integrating both sides w.r.t. µ we have
||V  - V K ||1,µ   K
K-1

and

2 Rmax 1-  t Cµ, (K - k - 1, t)||k (k+1 )||1, .


+
k=0

 K-k-1
t=0

The claim follows from the definition of Cµ, and by bounding ||k (k+1 )||1, using Theorem 1 with a union bound argument over the K iterations.

with Lr and Lp being the Lipschitz constants of the transitions and the reward, respectively. An important property of Lipschitz MDPs is that for any function Q  B Q (X × A; Qmax ), the function obtained by applying the Bellman operator T  to Q, (T  Q)(·, a), is a Lipschitz function with constant L = (Lr + Qmax Lp ) for any action a. Theorem 3. Let |A| = 2 and {n } be a universal family of policy spaces. Let M be a Lipschitz MDP. Then limn d(n , Gn ) = 0. Proof.
d(n , Gn ) = sup
(a)

5. Approximation Error
In Section 4.1, we derived a bound for the expected error at each iteration k of DPI, ||k (k+1 )||1, . The approximation error term in this bound is the inherent greedy error of Definition 3, d(, G), which depends on the MDP and the richness of the hypothesis space  (see the Remarks of Theorem 1). The main question in this section is whether this approximation error can be made small by increasing the capacity of the policy space . The answer is not obvious because when the space of policies, , grows, it can better approximate any greedy policy w.r.t. a policy in , however, the number of such greedy policies grows as well. We start our analysis of this approximation error by introducing the notion of universal family of policy spaces. Definition 4. Let {n } be a sequence of real values n such that n - 0. A sequence of policy spaces {n } is a universal family of policy spaces if for any n > 0 there exists a partition Pn = {Xi }Sn of X such that i=1 maxi maxx,yXi ||x - y|| = n and b1 , . . . , bSn , bi  {0, 1},   n such that (x) = bi , i, x  Xi . In other words, this definition requires n to be the space of policies induced by a partition Pn such that the largest diameter among the regions of the partition shrinks to zero and for any assignment of actions to regions there exists a policy   n matching

n   n

inf

 (x;   )(dx)
X

=

n   n

sup

inf

I (G)(x) =   (x)  (x)(dx)
X Sn

(b)

 sup
(c)

n   n Sn

inf

I (G)(x) =   (x)  (x)(dx)
Xi 

i=1

= sup

n i=1 aA Sn n i=1 aA Sn

inf

I {(G)(x) = a}  (x)(dx)
Xi

(d)


(e)

sup

inf

I {(G)(x) = a} 2L
Xi

y: (y)=0

inf

x - y (dx)

 2L sup
(f )

n i=1 aA Sn

inf

I {(G)(x) = a} n (dx)
Xi

 2Ln
i=1 Xi

(dx) = Ln .

Analysis of a Classification-based Policy Iteration Algorithm

(a) We rewrite Definition 3, where  is the regret of choosing the wrong action defined by Eq. 3. (b) Since n contains piecewise constants policies induced by the partition Pn = {Xi }, we split the integral as the sum over the regions. (c) Since the policies in n can take any action in each possible region, the policy   minimizing the loss is the one which takes the best action in each region. (d) Since M is Lipschitz, both maxaA Q (·, a) and mina A Q (·, a ) are Lipschitz and so  (·) is 2LLipschitz. Furthermore,  is zero in all the states in which the policy G changes (see Figure 2). Thus, for any state x the value  (x) can be bounded using the Lipschitz property by taking y as the closest state to x in which  (y) = 0. (e) We notice that if   makes a mistake in a state x  Xi then the state y in which G changes must be in Xi , otherwise if G is constant in the whole region Xi , there exists an action a such that no mistake is done in the region. Thus, we can replace ||x - y|| by the diameter of the region which is bounded by n by definition of universal family of spaces. (f ) We take I {(G)(x) = a} = 1 in each region.

N grows in Theorem 1, 1 and 2 become arbitrarily small, and thus, the expected error at each iteration, ||k (k+1 )||1, , converges to the inherent greedy error d(, G). We can conclude from the results of this section that DPI is not consistent in general, but it is consistent for the class of Lipschitz MDPs, when a universal family of policy spaces is used. However, it is important to note that as we increase the index n also the capacity of the policy space , its VC-dimension h, might grow as well, and thus, when the number of samples N goes to infinity, in order to keep the estimation error (1 in Theorem 1) zero, we should guarantee that N grows faster than V C(). More formally, Corollary 1. Let M be a Lipschitz MDP, {n } be a universal family of policy spaces, h(n) = V C(n ), and limn,N  h(n) = 0. Then DPI is consistent N
n,N 

lim ||k (k+1 )||1, = 0,

w.p. 1.

Finally, we notice that if n and N tend to infinity at each iteration, we have V K  V  almost surely when K tends to infinity.

6. Discussion and Extensions
In this paper, we derived finite-sample performance bounds for a cost-sensitive classification-based approximate policy iteration algorithm. To the best of our knowledge, this is the first complete finitesample analysis for this class of API algorithms. The main characteristic of DPI is that each classification error is weighed by its actual regret, i.e., the difference between the action values of the greedy action and the action chosen by DPI. Our results extend the theoretical analysis in (Fern et al., 2006) and (Dimitrakakis & Lagoudakis, 2008a) by 1) having a full bound instead of being limited to one step policy update, 2) considering any policy space instead of finite or specific policy spaces, and 3) deriving a bound which does not depend on the Q-advantage, i.e., the minimum Q-value gap between a greedy and a subgreedy action over the state space, which can be arbitrarily small in a large class of MDPs. Note that the final bound in Fern et al. (2006) depends inversely on the Q-advantage. We also analyzed the consistency of DPI and showed that although it is not consistent in general, it is consistent for the class of Lipschitz MDPs. This is similar to the consistency results for a fitted value iteration in Munos & Szepesv´ri (2008). One of the main motivations of this work is to have a better understanding of how the classification-based API methods can be compared with their widely-used regression-based counterparts. It is interesting to note that the bound of Eq. 4 shares the same structure as

Q (x, a2)


a2 a1

Q (x, a1) (G)(x)

 (x)

0

0.2

0.4

0.6

0.8

1

Figure 2. This figure is used as an illustrative example in the proof of Theorem 3. It shows the action-value function of a Lipschitz MDP for a policy , Q (·, a1 ) and Q (·, a2 ) (top) and the corresponding greedy policy G (middle) and regret of selecting the wrong action,  , (bottom).

Theorem 3 together with the counter-example in (Lazaric et al., 2010) show that the assumption on the policy space is not enough to guarantee a small approximation error and additional assumptions on the smoothness of the MDP must be satisfied. 5.2. Consistency of DPI A highly desirable property of any learning algorithm is to be consistent, i.e., as the number of samples grows to infinity, the error of the algorithm converges to zero. It can be seen that as the number of samples

Analysis of a Classification-based Policy Iteration Algorithm

the error bounds for the API algorithm in Antos et al. (2008) and fitted value iteration (Munos & Szepesv´ri, a 2008). The error at each iteration can be decomposed into an approximation error, which depends on the MDP and the richness of the hypothesis space ­ the inherent greedy error in Eq. 4 and the inherent Bellman error in Antos et al. (2008) and Munos & Szepesv´ri a (2008), and an estimation error which mainly depends on the number of samples and rollouts. The difference between the approximation error of the two approaches depends mainly on how well the hypothesis space fits the MDP at hand. This confirms the intuition that whenever the policies generated by policy iteration are easier to represent and learn than their value functions, a classification-based approach can be preferable to regression-based methods. Extension to multiple actions. When |A| = 2, the expected error in Eq. 2 can be written as ||k ()||1, = X I {(Gk )(x) = (x)} k (x)(dx), where k is defined by Eq. 3. Thus, the policy improvement step becomes a weighted binary classification problem in which each state x  D has a weight k (x). DPI can be extended to multiple actions by writing the expected error as
||k ()||1, =
aA X

Bagnell, J., Kakade, S., Ng, A., and Schneider, J. Policy search by dynamic programming. In Proceedings of Advances in Neural Information Processing Systems 16. MIT Press, 2003. Bradtke, S. and Barto, A. Linear least-squares algorithms for temporal difference learning. Journal of Machine Learning, 22:33­57, 1996. Dimitrakakis, C. and Lagoudakis, M. Algorithms and bounds for sampling-based approximate policy iteration. In Recent Advances in Reinforcement Learning (EWRL-2008). Springer, 2008a. Dimitrakakis, C. and Lagoudakis, M. Rollout sampling approximate policy iteration. MLJ, 72(3):157­171, 2008b. Fern, A., Yoon, S., and Givan, R. Approximate policy iteration with a policy language bias. In Proceedings of Advances in Neural Information Processing Systems 16, 2004. Fern, A., Yoon, S., and Givan, R. Approximate policy iteration with a policy language bias: Solving relational Markov decision processes. JAIR, 25:85­118, 2006. Howard, R. A. Dynamic Programming and Markov Processes. The MIT Press, Cambridge, MA, 1960. Lagoudakis, M. and Parr, R. Least-squares policy iteration. JMLR, 4:1107­1149, 2003a. Lagoudakis, M. and Parr, R. Reinforcement learning as classification: Leveraging modern classifiers. In Proceedings of ICML, pp. 424­431, 2003b. Langford, J. and Zadrozny, B. Relating reinforcement learning performance to classification performance. In Proceedings of ICML, pp. 473­480, 2005. Lazaric, A., Ghavamzadeh, M., and Munos, R. Analysis of a classification-based policy iteration algorithm. Technical Report 00482065, INRIA, 2010. Li, L., Bulitko, V., and Greiner, R. Focus of attention in reinforcement learning. Journal of Universal Computer Science, 13(9):1246­1269, 2007. Munos, R. Performance bounds in Lp norm for approximate value iteration. SIAM Journal of Control and Optimization, 2007. Munos, R. and Szepesv´ri, Cs. Finite time bounds for a fitted value iteration. JMLR, 9:815­857, 2008. Vapnik, V. and Chervonenkis, A. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264­280, 1971.

I {(Gk )(x, a) = (x, a)} max Qk (x, a ) - Qk (x, a) (dx),

×

a A

where (Gk )(x, a) is 1 if a is the greedy action in x and 0 otherwise. As it can be noticed, the policy improvement step of DPI still remains a weighted binary classification problem in which each (x, a)  D × A is weighted by maxa A Qk (x, a ) - Qk (x, a). This can be solved by any weighted binary classification algorithm as long as it guarantees to return 1 for only one action at each state x  X . In this case, all the theoretical analysis presented in the paper can be extended to multiple actions (see the appendix in (Lazaric et al., 2010) for a sketch of the proofs). However, as there are still many open theoretical and practical issues to be addressed in multi-label classification, extending DPI to multiple actions calls for additional work both in terms of implementation and theoretical analysis. Acknowledgments This work was supported by French National Research Agency (ANR) (project EXPLO-RA n ANR-08-COSI-004).

References
Antos, A., Szepesv´ri, Cs., and Munos, R. Learna ing near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. MLJ, 71:89­129, 2008.