Bounding Performance Loss in Approximate MDP Homomorphisms

Jonathan J. Taylor Dept. of Computer Science University of Toronto Toronto, Canada, M5S 3G4 jonathan.taylor@utoronto.ca

Doina Precup School of Computer Science McGill University Montreal, Canada, H3A 2A7 dprecup@cs.mcgill.ca

Prakash Panangaden School of Computer Science McGill University Montreal, Canada, H3A 2A7 prakash@cs.mcgill.ca

Abstract
We define a metric for measuring behavior similarity between states in a Markov decision process (MDP), in which action similarity is taken into account. We show that the kernel of our metric corresponds exactly to the classes of states defined by MDP homomorphisms (Ravindran & Barto, 2003). We prove that the difference in the optimal value function of different states can be upper-bounded by the value of this metric, and that the bound is tighter than previous bounds provided by bisimulation metrics (Ferns et al. 2004, 2005). Our results hold both for discrete and for continuous actions. We provide an algorithm for constructing approximate homomorphisms, by using this metric to identify states that can be grouped together, as well as actions that can be matched. Previous research on this topic is based mainly on heuristics.

1 Introduction
Markov Decision Processes (MDPs) are a very popular formalism for decision making under uncertainty (Puterman, 1994). A significant problem is computing the optimal strategy when the state and action space are very large and/or continuous. A popular approach is state abstraction, in which states are grouped together in partitions, or aggregates, and the optimal policy is computed over these. Li et al. (2006) provide a nice comparative survey of approaches to state abstraction. The work we present in this paper bridges two such methods: bisimulation-based approaches and methods based on MDP homomorphisms. Bisimulation is a well-known, well-studied notion of behavioral equivalence between systems (Larsen & Skou, 1991; Milner, 1995) which has been specialized for MDPs by Givan et al (2003). In recent work, Ferns et al. (2004, 2005, 2006) introduced (pseudo)metrics for measuring the similarity of states, which provide approximations to bisimulation. One of the disadvantages of bisimulation and the corresponding metrics is that they require that the behavior matches for exactly the same actions. However, in many cases of practical interest, actions with the exact same label may not match, but the environment may contain symmetries and other types of special structure, which may allow correspondences between states by matching their behavior with different actions. This idea was formalized by (Ravindran & Barto, 2003) with the concept of MDP homomorphisms. MDP homomorphisms specify a map matching equivalent states as well as equivalent actions in such states. This matching can then be used to transfer policies between different MDPs. However, like any equivalence relations in probabilistic systems, MDP homomorphisms are brittle: a small change in the transition probabilities or the rewards can cause two previously equivalent state-action pairs to become distinct. This implies that such approaches do not work well in situations in which the model of the system is estimated from data. As a solution to this problem, Ravindran & Barto (2004) proposed using approximate homomorphisms, which allow aggregating states that are not exactly equivalent. They define an MDP over these partitions and quantify the approximate loss resulting from using this MDP, compared to the original system. As expected, the bound depends on


the quality of the partition. Subsequent work (e.g. Wolfe & Barto, 2006) constructs such partitions heuristically. In this paper, we attempt to construct provably good, approximate MDP homomorphisms from first principles. First, we relate the notion of MDP homomorphisms to the concept of lax bisimulation, explored recently in the process algebra literature (Arun-Kumar, 2006). This allows us to define a metric on states, similarly to existing bisimulation metrics. Interestingly, this approach works both for discrete and for continuous actions. We show that the difference in the optimal value function of two states is bounded above by this metric. This allows us to provide a state aggregation algorithm with provable approximation guarantees. We illustrate empirically the fact that this approach can provide much better state space compression than the use of existing bisimulation metrics.

2 Background
A finite Markov decision process (MDP) is a tuple S, A, P, R , where S is a finite set of states, A is a set of actions, P : S × A × S  [0, 1] is the transition model, with P(s, a, s ) denoting the probability of transition from state s to s under action a, and R : S × A  R is the reward function with R(s, a) being the reward for performing action a in state s. For the purpose of this paper, the state space S is assumed to be finite, but the action set A could be finite or infinite (as will be detailed later). We assume without loss of generality that rewards are bounded in [0, 1]. A deterministic policy  : S  A specifies which action should be taken in every state. By following policy  from state s, an agent can expect a value of V  (s) = E (t 1 t -1 rt |s0 = s, ) where   (0, 1) = is a discount factor and rt is the sample reward received at time t . In a finite MDP, the optimal value function V  is unique and satisfies the following formulas, known as the Bellman optimality equations: , R V  (s) = max
a A

(s, a) +   P(s, a, s )V  (s )
s

s  S

If the action space is continuous, we will assume that it is compact, so the max can be taken and the above results still hold (Puterman, 1994). Given the optimal value function, an optimal policy is easily inferred by simply taking at every state the greedy action with respect to the one-steplookahead value. It is well known that the optimal value function can be computed by turning the above equation into an update rule, which can be applied iteratively. Ideally, if the state space is very large, "similar" states should be grouped together in order to speed up this type of computation. Bisimulation for MDPs (Givan et al., 2003) is a notion of behavioral equivalence between states. A relation E  S × S is a bisimulation relation if: sE u  a.(R(s, a) = R(u, a) and X  S/E .Pr(X |s, a) = Pr(X |u, a)) where S/E denotes the partition of S into E -equivalent subsets of states. The relation  is the union of all bisimulation relations and two states in an MDP are said to be bisimilar if s  u. From this definition, it follows that bisimilar states can match each others' actions to achieve the same rewards. Hence, bisimilar states have the same optimal value (Givan et al., 2003). However, bisimulation is not robust to small changes in the rewards or the transition probabilities. One way to avoid this problem is to quantify the similarity between states using a (pseudo)-metric. Ferns et al. (2004) proposed a bisimulation metric, defined as the least fixed point of the following operator on the lattice of 1-bounded metrics d : S × S  [0, 1]: G(d )(s, t ) = max(cr |R(s, a) - R(u, a)| + c pK (d )(P(s, a, ·), P(u, a, ·)) (1 )
a

The first term above measures reward similarity. The second term is the Kantorovich metric between the probability distributions of the two states. Given probability distributions P and Q over the state space S, and a semimetric d on S, the Kantorovich metric K (d )(P, Q) is defined by the following linear program: max  (P(si ) - Q(si ))vi subject to: i, j.vi - v j  d (si , s j ) and i.0  vi  1
vi i=1 |S|

which has the following equivalent dual program: min
 k j k , j =1


|S|

k j d (sk , s j ) subject to: k.  k j = P(sk ),  j.  k j = Q(s j ) and k, j.k j  0
j k


Ferns et al. (2004) showed that by applying (1) iteratively, the least fixed point e f ix can be obtained, and that s and u are bisimilar if and only if e f ix (s, u) = 0. In other words, bisimulation is the kernel of this metric.

3 Lax bisimulation
In many cases of practical interest, actions with exactly the same label may not match, but the environment may contain symmetries and other types of special structure, which may allow correspondences between different actions at certain states. For example, consider the environment in Figure 1. Because of symmetry, going south in state N6 is "equivalent" to going north in state S6. However, no two states are bisimilar. Recent work in process algebra has rethought the definition of bisimulation to allow certain distinct actions to be essentially equivalent(Arun-Kumar, 2006). Here, we define lax bisimulation in the context of MDPs. Definition 1. A relation B is a lax (probabilistic) bisimulation relation if whenever sBu we have that: a b such that R(s, a) = R(u, b) and for all B-closed sets X we have that Pr(X |s, a) = P(X |u, b), and vice versa. The lax bisimulation  is the union of all the lax bisimulation relations. It is easy to see that B is an equivalence relation and we denote the equivalence classes of S by S/B. Note that the definition above assumes that any action can be matched by any other action. However, the set of actions that can be used to match another action can be restricted based on prior knowledge. Lax bisimulation is very closely related to the idea of MDP homomorphisms (Ravindran & Barto, 2003). We now formally establish this connection. Definition 2. (Ravindran & Barto, 2003) A MDP homomorphism h from M = S, A, P, R to M  = S , A , P , R is a tuple of surjections f , {gs : s  S} with h(s, a) = ( f (s), gs (a)), where f : S  S and gs : A  A such that R(s, a) = R ( f (s), gs (a)) and P(s, a, f -1 ( f (s ))) = P ( f (s), gs (a), f (s )) Hence, a homomorphism puts in correspondence states, and has a state-dependent mapping between actions as well. We now show that homomorphisms are identical to lax probabilistic bisimulation. Theorem 3. Two states s and u are bisimilar if and only if they are related by some MDP homomorphism f , gs : s  S in the sense that f (s) = f (u). Proof: For the first direction, let h be a MDP homomorphism and define the relation B such that sBu iff f (s) = f (u). Since gu is a surjection to A, there must be some b  A with gu (b) = gs (a). Hence, R(s, a) = R ( f (s), gs (a)) = R ( f (u), gu (b)) = R(u, b) Let X be a non-empty B-closed set such that f -1 ( f (s )) = X for some s . Then: P(s, a, X ) = P ( f (s), gs (a), f (s )) = P ( f (u), gu (b), f (s )) = P(u, b, X ) so B is a lax bisimulation relation. For the other direction, let B be a lax bisimulation relation. We will construct an MDP homomorphism in which sBu = f (s) = f (u). Consider the partition S/B induced by the equivalence relation B on set S. For each equivalence class X  S/B, we choose a representative state sX  X and define f (sX ) = sX and gsX (a) = a, a  A. Then, for any s  sX , we define f (s) = sX . From definition 1, we have that ab, Pr(X  |s, a) = Pr(X  |sX , b), X   S/B. Hence, we set gs (a) = b. Then, we have: P ( f (s), gs (a), f (s )) = P ( f (sX ), b , f -1 ( f (s )) = P(sX , b, f -1 ( f (s )) = P(s, a, f -1 ( f (s )) Also, R ( f (s), gs (a)) = R ( f (sX ), b) = R(sX , a). Hence, we constructed a homomorphism. 

4 A metric for lax bisimulation
We will now define a lax bisimulation metric for measuring similarity between state-action pairs, following the approach used by Ferns et al. (2004) for defining the bisimulation metric between states. We want to say that states s and u are close exactly when every action of one state is close to some action available in the other state. In order to capture this meaning, we first define similarity between state-action pairs, then we lift this to states using the Hausdorff metric (Munkres, 1999).


Definition 4. Let cr , c p  0 be constants with cr + ct  1. Given a 1-bounded semi-metric d on S, the metric (d ) : S × A  [0, 1] is defined as follows: (d )((s, a), (u, b)) = cr |R(s, a) - R(u, b)| + c pK (d )(P(s, a, ·), P(u, b, ·)) We now have to measure the distance between the set of of actions at state s and the set of actions at state u. Given a metric between pairs of points, the Hausdorff metric can be used to measure the distance between sets of points. It is defined as follows. Definition 5. Given a finite 1-bounded metric space (M , d ), let P (M ) be the set of compact spaces (e.g. closed and bounded in R). The Hausdorff metric H (d ) : P (M ) × P (M )  [0, 1] is defined as: H (d )(X , Y ) = max(sup inf d (x, y), sup inf d (x, y))
xX yY yY xX

Definition 6. Denote Xs = {(s, a)|a  A}. Let M be the set of all semimetrics on S. We define the operator F : M  M as F (d )(s, u) = H ((d ))(Xs , Xu ) We note that the same definition can be applied both for discrete and for compact continuous action spaces. If the action set is compact then Xs = {s} × A is also compact, so the Hausdorff metric is still well defined. For simplicity, we consider the discrete case, so that max and min are defined. Theorem 7. F is monotonic and has a least fixed point d f ix in which d f ix (s, u) = 0 iff s  u. The proof is similar in flavor to (Ferns et al., 2004) and we omit it for lack of space. As both e f ix and d f ix quantify the difference in behaviour between states, it is not surprising to see that they constrain the difference in optimal value. Indeed, the bound below has previously been shown in (Ferns et al., 2004) for e f ix , but we also show that our metric d f ix is tighter. Theorem 8. Let e f ix be the metric defined in (Ferns et al., 2004). Then we have: cr |V  (s) - V  (u)|  d f ix (s, u)  e f ix (s, u) Proof: We show via induction on n that for the sequence of iterates Vn encountered during value iteration, cr |Vn (s) - Vn (u)|  d f ix (s, u)  e f ix (s, u), and then the result follows by merely taking limits. For the base case note that cr |V0 (s) - V0(u)| = d0 (s, u) = e0 (s, u) = 0. Assume this holds for n. By the monotonicity of F , we have that F (dn )(s, u)  F (en )(s, u). Now, for any a, (en )((s, a), (u, a))  G(en )(s, u), which implies: F (en )(s, u)  max(max (en )((s, a), (u, a)), max (en )((s, b), (u, b))
a b

 cr |Vn+1 (s) - Vn+1(u)| = = =  

max(max G(en )(s, u), G(en )(s, u)) = G(en )(s, u)
a

so dn+1  en+1 Without loss of generality, assume that Vn+1(s) > Vn+1(u). Then:
a s b

cr | max(R(s, a) +   P(s, a, s )Vn (s )) - max(R(u, b) +   P(u, b, s )Vn (s ))| cr |(R(s, a ) +   P(s, a , s )Vn (s )) - (R(t , b ) +   P(u, b , s )Vn (s ))|
     s  s s

cr min |(R(s, a ) +   P(s, a , s )Vn (s )) - (R(u, b) +   P(u, b, s )Vn (s ))|
    b

cr max min |(R(s, a) +   P(s, a, s )Vn (s )) - (R(t , b) +   P(u, b, s )Vn (s ))|
  a b s s

s

s

max min(cr |R(s, a) - R(u, b)| + c p| (P(s, a, s ) - P(u, b, s))
a b s (1 - c )

cr  Vn (s )|) cp

So { crp Vn (s ) : s  S} is a feasible solution to the LP for K (dn )(P(s, a), P(t , b)). We then continue c the inequality: cr |Vn+1 (s) - Vn+1 (u)|  max min(cr |R(s, a) - R(u, b)| + c pTk (dn )(P(s, a), P(u, b))) = F (dn )(s, u) = dn+1(s, u)
a b

p Now since   c p , we have 0  crp Vi (s )  c p (1-)  1 and by the induction hypothesis c cr  cr  Vn (s) - Vn (u)  cr |Vn (s) - Vn (u)|  dn (s, u) cp cp


5 State aggregation
We now show how we can use this notion of lax bisimulation metrics to construct approximate MDP homomorphisms. First, if we have an MDP homomorphisms, we can use it to provide a state space aggregation, as follows. Definition 9. Given a MDP M and a homomorphism, an aggregated MDP M  is given by (S , A, {P(C, a, D) : a  A; C, D  S }, {R(C, a) : a  A, C  S }, , gs : s  S) where S is a partition of S,  : S  S maps states to their aggregates, each gs : A  A relabels the action set and we have that C, D  S and a  A, P(C, a, D) = 1 1  P(s, gs (a), D) and R(C, a) = |C|  R(s, gs (a)) |C| sC sC

Note that all the states in a partition have actions that are relabelled specifically so they can exactly match each other's behaviour. Thus, a policy in the aggregate MDP can be lifted to the original MDP by using this relabeling. Definition 10. If M  is an aggregation of a MDP M and  is a policy in M  then the lifted policy is defined by (s) = gs ( (a)). Using a lax bisimulation metric, it is possible to choose appropriate relabellings so that states within a partition can approximately match each other's actions. Definition 11. Given a lax bisimulation metric d and a MDP M , we say that an aggregated MDP M  is d -consistent if each aggregated class C has a state s  C, called the representitive of C, such that: u  C, (d )((s, gs (a)), (s, gu (a)))  F (d )(s, u) When the relabellings are chosen in this way, we can solve for the optimal value function of the aggregated MDP and be assured that for each state, its true optimal value is close to the optimal value of the partition in which it is contained. Theorem 12. If M  is an d -consistent aggregation of a MDP M and n   then s  S we have that cr |Vn ((s)) - Vn(s)|  m((s)) + M  n-k .
k =1 n -1

Furthermore, if  is a policy in M  and  is the corresponding lifted policy in M , then:
  cr |Vn ((s)) - Vn (s)|  m((s)) + M  n-k


n -1 k =1

where m(C) = 2 maxuC d Proof:

(s , u)

such that

s

is the representative state of C and M = maxC m(C).

|Vn+1 ((s)) - Vn+1(s)| = | max(R((s), a) + 
a

DS

  P((s), a, D)Vn (D)) - maax(R(s, a) +   P(s, a, s )Vn(s ))| 
s DS

 
1  m ax |(s)| u(s) a 1  m ax |(s)| u(s) a

| |

R(u, gu (a)) - R(s, gs (a))| + |

 P(u, gu(a), D)Vn (D) -  P(s, gs (a), s )Vn(s )| 
s

R(u, gu (a)) - R(s, gs (a))| + | (P(u, gu (a), s )Vn ((s )) - P(s, gs (a), s )Vn (s ))|
s

1  max(|R(u, gu(a)) - R(s, gs(a))| + | (P(u, gu (a), s ) - P(s, gs(a), s ))Vn (s ) |(s)| u(s) a s + |  P(u, gu (a), s )(Vn ((s )) - Vn(s ))|) 
s

+ c p | (P(u, gu (a), s ) - P(s, gs (a), s ))
s

cr   Vn (s )|) +  max  P(u, gu(a), s )|Vn((s )) - Vn(s )| cp |(s)| u(s) a  s

1  max(cr |R(s, gs (a)) - R(u, gu(a))| cr |(s)| u(s) a


From the previous theorem we know that { crp Vn (s ) : s  S} is a feasible solution to the primal LP c for K (dn )(P(s, gs (a)), P(u, gu (a))). Let z be the representative used for (s). Then we can continue as follows:  cr |R(s, gs (a) - R(u, gu(a))| + c pK (dn )(P(s, gs (a)), P(u, gu (a)))  cr |R(s, gs (a)) - R(u, gu(a))| + c pK (d )(P(s, gs (a)), P(u, gu (a)))  cr |R(s, gs (a)) - R(z, ga)| + c pK (d )(P(s, gs (a)), P(z, ga )) z z + cr |R(z, ga ) - R(u, gu(a))| + c pK (d )(P(z, ga ), P(u, gu (a))) = d (s, z) + d (z, u)  m((s)) z z We continue with the original inequality using these two results:  +   1  (cr |R(s, gs (a)) - R(u, gu(a))| + c pK (dn )(P(s, gs (a)), P(u, gu (a)))) c r u  ( s )   max  P(u, gu(a), s ) msa x |Vn((s )) - Vn(s )|  |(s)| u(s) a  s

n -1 m((s)) m((s)) 1  m((s)) +  msax |Vn ((s )) - Vn(s )|  cr +  msa x( cr + M  n-k )   cr |(s)| u(s) k =1 n -1 n 1 1 (m((s)) +  max m((s )) + M  n+1-k )  (m((s)) + M  (n+1)-k ) cr cr s k =1 k =1

The second proof is nearly identical except that instead of maximizing over actions the action selected by the policy, a =  ((s)), and lifted policy gs (a) = (s) is used. By taking limits we get the following theorem: Theorem 13. If M  is an d f ix -consistent aggregation of a MDP M , then s  S we have that cr |V  ((s)) - V  (s)|  m((s)) +  M 1-

Furthermore, if  is any policy in M  and  is the lifted policy to M then   M cr |V  ((s)) - V  (s)|  m((s)) + 1- where m(C) = 2 maxuC d f ix (s , u) such that s is the representative state of C and M = maxC m(C). One appropriate way to aggregrate states is to choose some desired error bound  > 0 and ensure that the states in each partition are within an -ball. A simple way to do this is to pick states and random and add to a partition each state within the -ball. Of course, better clustering heuristics can be used here as well. It has been noted that when the above condition holds, then under the unlaxed bisimulation metric e f ix , we can be assured that for each state s, |V  ((s)) - V (s)| is bounded by cr (2 ) . The theorem 1-

above shows that under the lax bisimulation metric d f ix this difference is actually bounded by cr (4 ) . 1- However, as we illustrate in the next section. a massive reduction in the size of the state space can  be achieved by moving from e f ix to d f ix , even when using  = 2 .

For large systems, it might not be feasible to compute the metric e f ix in the original MDP. In this case, we might want to use some sort of heuristic or prior knowledge to create an aggregation. Ravindran & Barto (2003) provided, based on a result from Whitt (1978) a bound on the difference in values between the optimal policy in the aggregated MDP and the lifted policy in the original MDP. We now show that our metric can be used to tighten this bound. Theorem 14. If M  is an aggregation of a MDP M ,  is an optimal policy in M  ,  is the policy lifted from  to M and d f ix corresponds to our metric computed on M  then |V  (s) - V  ((s))| 


2  max |R(s, gs (a)) - R((s), a)| + max K (d f ix )(P(s, gs (a)), P((s), a)) 1 -  s,a cr s,a


N6 N5 N4
25 30

Comparison of Laxed and Unlaxed Lumping Performance Unlaxed Metric Laxed Metric

N3 N2 N1 E6 E5 E4 E3 E2 E1 C W1 W2 W3 W4 W5 W6 S1 S2 S3 S4 S5 S6
0 0.0 0.2 0.4 Epsilon 0.6 0.8 1.0 Lumped States 20

15

10

5

Figure 1: Example environment exhibiting symmetries (left). Aggregation performance (right) Proof: We have: | |V  (s) - V  ((s))) 
  2 max |R(s, gs (a)) - R((s), a) +  (P(s, gs (a), C) - P((s), a, C))V  (C)| 1 -  s,a C

 
 2 max |R(s, gs (a)) - R((s), a)| +  max | (P(s, gs (a), C) - P((s), a, C))V  (C)| s,a 1 -  s,a C

2  max |R(s, gs (a)) - R((s), a)| + max K (d f ix )(P(s, gs (a)), P((s), a)) s,a s,a cr 1-

The first inequality originally comes from (Whitt, 1978) and is applied to MDPs in (Ravindran & Barto, 2003). The last inequality holds since  is an optimal policy and thus by Theorem 8 we know that { V
 (C)

cr

: C  S } is a feasible solution. 

As a corrolary, we can get the same bound as in (Ravindran & Barto, 2003) by bounding the Kantorovich by the total variation metric. Definition 15. Given two finite distributions P and Q, the total variation metric T V (P, Q) is defined 1 as: T V (P, Q) = s 2 |P(s) - Q(s)| Corollary 16. Let  = maxC,a R(C, a) - minC,a R(C, a) be the maximum difference in rewards in the aggregated MDP. Then: m P  2 ax |R(s, gs (a)) - R((s), a)| +  · T V (P(s, gs (a)), P((s), a)) |V  (s) - V  ((s))|  1 -  s,a 1- roof: This follows from the fact that: max d f ix (C, D)  cr  + c p max d f ix (C, D) · · · 
C,D C,D

cr  cr   1 - cp 1 - 

and using the total variation as an approximation (Gibbs & Su, 2002), we have: K (d f ix )(P(s, gs (a)), P((s), a))  max d f ix (C, D) · T V (P(s, gs (a)), P((s), a))
C,D

6 Illustration
Consider the cross MDP displayed in Figure 1. There is a reward of 1 in the center and the probability of the agent moving in the intended direction is 0.8. For a given , we used the random partitioning algorithm outlined earlier to create a state aggregation. The graph plots the size of the aggregated MDPs against , using the lax and the non-lax bisimulation metrics. In the case of the  lax metric, we used  = 2 to compensate for the factor of 2 difference in the error bound. It is very revealing that the number of partitions drops very quickly and levels at around 6 or 7 for our algorithm. This is because the MDP is collapsing to a state space close to the natural choice of {{C}}  {{N i, Si, W i, E i} : i  {1, 2, 3, 4, 5, 6}}. Under the unlaxed metric, this is not likely to occur, and thus the first states to be partitioned together are the ones neighbouring each other (which can actually have quite different behaviours).


7 Discussion and future work
We defined a metric for measuring the similarity of state-action pairs in a Markov Decision Process and used it in an algorithm for constructing approximate MDP homomorphisms. Our approach works significantly better than the bisimulation metrics of Ferns et al., as it allows capturing different regularities in the environment. The theoretical bound on the error in the value function presented in (Ravindran & Barto, 2004) can be derived using our metric. Although the metric is potentially expensive to compute, there are domains in which having an accurate aggregation is worth it. For example, in mobile device applications, one may have big computational resources initially to build an aggregation, but may then insist on a very coarse, good aggregation, to fit on a small device. The metric can also be used to find subtasks in a larger problem that can be solved using controllers from a pre-supplied library. For example, if a controller is available to navigate single rooms, the metric might be used to lump states in a building schematic into "rooms". The aggregate MDP can then be used to solve the high level navigational task using the controller to navigate specific rooms. An important avenue for future work is reducing the computational complexity of this approach. Two sources of complexity include the quadratic dependence on the number of actions, and the evaluation of the Kantorovich metric. The first issue can be addressed by sampling pairs of actions, rather than considering all possibilities. We are also investigating the possibility of replacing the Kantorovich metric (which is very convenient from the theoretical point of view) with a more practical approximation. Finally, the extension to continuous states is very important. We currently have preliminary results on this issue, using an approach similar to (Ferns et al, 2005), which assumes lower-semi-continuity of the reward function. However, the details are not yet fully worked out. Acknowledgements: This work was funded in part by NSERC and CFI.

References
Arun-Kumar, S. (2006). On bisimilarities induced by relations on actions. SEFM '06: Proceedings of the Fourth IEEE International Conference on Software Engineering and Formal Methods (pp. 41­49). Washington, DC, USA: IEEE Computer Society. Ferns, N., Castro, P. S., Precup, D., & Panangaden, P. (2006). Methods for computing state similarity in Markov Decision Processes. Proceedings of the 22nd UAI. Ferns, N., Panangaden, P., & Precup, D. (2004). Metrics for finite markov decision processes. Proceedings of the 20th UAI (pp. 162­169). Ferns, N., Panangaden, P., & Precup, D. (2005). Metrics for markov decision processes with infinite state spaces. Proceedings of the 21th UAI (pp. 201­209). Gibbs, A., & Su, F. (2002). On choosing and bounding probability metrics. Givan, R., Dean, T., & Greig, M. (2003). Equivalence notions and model minimization in Markov Decision Processes. Artificial Intelligence, 147, 163­223. Larsen, K. G., & Skou, A. (1991). Bisimulation through probabilistic testing. Inf. Comput., 94, 1­28. Li, L., Walsh, T. J., & Littman, M. L. (2006). Towards a unified theory of state abstraction for MDPs. Proceedings of the International Symposium on Artificial Intelligence and Mathematics. Milner, R. (1995). Communication and concurrency. Hertfordshire, UK, UK: Prentice Hall International (UK) Ltd. Munkres, J. (1999). Topology. Prentice Hall. Puterman, M. L. (1994). Markov decision processes: discrete stochastic dynamic programming. Wiley. Ravindran, B., & Barto, A. G. (2003). Relativized options: Choosing the right transformation. Proceedings of 20th ICML (pp. 608­615). Ravindran, B., & Barto, A. G. (2004). Approximate homomorphisms: A framework for non-exact minimization inn Markov Decision Processes. Proceedings of the Fifth International Conference on Knowledge Based Computer Systems. Whitt, W. (1978). Approximations of dynamic programs i. Mathematics of Operations Research, 3, 231­243. Wolfe, A. P., & Barto, A. G. (2006). Decision tree methods for finding reusable MDP homomorphisms. Proceedings of AAAI.