Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs

Istv´n Szita a Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ, 08854, USA Andr´s Lrincz a o E¨tv¨s Lor´nd University, 1116 Budapest, P´zm´ny P. s´t´ny 1/C, Hungary o o a a a ea

szityu@gmail.com andras.lorincz@inf.elte.hu

Abstract
In this paper we propose an algorithm for polynomial-time reinforcement learning in factored Markov decision processes (FMDPs). The factored optimistic initial model (FOIM) algorithm, maintains an empirical model of the FMDP in a conventional way, and always follows a greedy policy with respect to its model. The only trick of the algorithm is that the model is initialized optimistically. We prove that with suitable initialization (i) FOIM converges to the fixed point of approximate value iteration (AVI); (ii) the number of steps when the agent makes non-near-optimal decisions (with respect to the solution of AVI) is polynomial in all relevant quantities; (iii) the per-step costs of the algorithm are also polynomial. To our best knowledge, FOIM is the first algorithm with these properties.

areas of the state space, the model gradually gets more realistic, inspiring the agent to head for unknown regions and explore them, in search of some imaginary "Garden of Eden". The working of the algorithm is simple to the extreme: it will not make any explicit effort to balance exploration and exploitation, but always follows the greedy optimal policy with respect to its model. We show in this paper that this simple (even simplistic) trick is sufficient for effective FMDP learning. The algorithm is an extension of OIM (optimistic initial model ) (Szita & Lrincz, 2008b), which is o a sample-efficient learning algorithm for flat MDPs. There is an important difference, however, in the way the model is solved. Every time the model is updated, the corresponding value function needs to be re-calculated (or updated) For flat MDPs, this is not a problem: various dynamic programming-based algorithms (like value iteration) can solve the model to any required accuracy in polynomial time. The situation is less bright for generating near-optimal FMDP solutions: all currently known algorithms may take exponential time, e.g. the approximate policy iteration of Boutilier et al. (2000) using decision-tree representations of policies, or solving the exponentialsize flattened version of the FMDP. If we require polynomial running time (as we do in this paper in search for a practical algorithm), then we have to accept suboptimal solutions (Liberatore, 2002). The only known example of a polynomial-time FMDP planner is factored value iteration (FVI) (Szita & Lrincz, 2008a), o which will serve as the base planner for our learning method. This planner is guaranteed to converge, and the error of its solution is bounded by a term depending only on the quality of function approximators. Our analysis of the algorithm will follow the established techniques for analyzing sample-efficient rein-

1. Introduction
Factored Markov decision processes (FMDPs) are practical ways to compactly formulate sequential decision problems--provided that we have ways to solve them. When the environment is unknown, all effective reinforcement learning methods apply some form of the "optimism in the face of uncertainty" principle: whenever the learning agent faces the unknown, it should assume high rewards in order to encourage exploration. Factored optimistic initial model (FOIM) takes this principle to the extreme: its model is initialized to be overly optimistic. For more often visited
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs

forcement learning (like the works of Kearns and Singh (1998); Brafman and Tennenholtz (2001); Strehl and Littman (2005); Szita and Lrincz (2008b) on flat o MDPs and Strehl (2007) on FMDPs). However, the listed proofs of convergence rely critically on access to a near-optimal planner, so they have to be generalized suitably. By doing so, we are able to show that FOIM converges to a bounded-error solution in polynomial time with high probability. We introduce basic concepts and notations in section 2, then in section 3 we review existing work, with special emphasis to the immediate ancestors of our method. In sections 4 and 5 we describe the blocks of FOIM and the FOIM algorithm, respectively. We finish the paper with a short analysis and discussion.

rithms, most notably, value iteration: vt+1 := maxaA ra + P a vt , which converges to v for any initial vector v0 . 2.2. Factored structure We assume that X is the Cartesian product of m smaller state spaces (corresponding to individual variables): X = X1 × X2 × . . . × Xm . For the sake of notational convenience we will assume that each Xi has the same size, |X1 | = |X2 | = . . . = |Xm | = n. With this notation, the size of the full state space is N = |X| = nm . We note that all derivations and proofs carry through to different size variable spaces. Definition 2.1 For any subset of variable indices Z  {1, 2, . . . , m}, let X[Z] := × Xi , furthermore,
iZ

(2)

2. Basic concepts and notations
An MDP is characterized by a quintuple (X, A, R, P, ), where X is a finite set of states; A is a finite set of possible actions; R : X × A  R is the reward function of the agent; P : X×A×X  [0, 1] is the transition function; and finally,   [0, 1) is the discount rate on future rewards. A (stationary, Markov) policy of the agent is a mapping  : X × A  [0, 1]. The optimal value function V  : X  R gives the maximum attainable total rewards for each state, and satisfies the Bellman equation V  (x) = max
a y

for any x  X, let x[Z] denote the value of the variables with indices in Z. We shall also use the notation x[Z] without specifying a full vector of values x, in such cases x[Z] denotes an element in X[Z]. For single-element sets Z = {i} we shall also use the shorthand x[{i}] = x[i]. Definition 2.2 (Local-scope function) A function f is a local-scope function if it is defined over a subspace X[Z] of the state space, where Z is a (presumably small) index set. If |Z| is small, local-scope functions can be represented efficiently, as they can take only n|Z| different values. Definition 2.3 (Extension) For f : X[Z]  R be a local-scope function. Its extension to the whole state space is defined by f (x) := f (x[Z]). The extension operator for Z is a linear operator with a matrix E[Z]  R|X|×|X[Z]| , with entries E[Z]
u,v[Z]

P (y | x, a) R(x, a) + V  (y) . (1)

Given the optimal value function, it is easy to get an optimal policy:   (x, a) := 1 iff a = arg maxa y P (y | x, a) R(x, a)+V  (y) and 0 otherwise. 2.1. Vector notation Let N := |X|, and suppose that states are integers from 1 to N , i.e. X = {1, 2, . . . , N }. Clearly, value functions are equivalent to N -dimensional vectors of reals, which may be indexed with states. The vector corresponding to V will be denoted as v and the value of state x by vx . Similarly, for each a let us define the N -dimensional column vector ra with entries ra = x a R(x, a) and N × N matrix P a with entries Px,y = P (y | x, a). The Bellman equations can be expressed in vector notation as v = maxaA ra + P a v , where max denotes the componentwise maximum operator. The Bellman equations are the basis to many RL algo-

=

1, if u[Z] = v[Z]; 0, otherwise.

For any local-scope function f with a corresponding vector representation f  R|X[Z]|×1 , E[Z] f  R|X|×1 is the vector representation of the extended function. We assume that the reward function is the sum of J local-scope functions with scopes Zj : J R(x, a) = j=1 Rj (x[Zj ], a). In vector notation:
a a ra = j=1 E[Zj ] ri . We also assume that for each variable i there exist neighborhood sets i such that J

Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs

the value of xt+1 [i] depends only on xt [i ] and the action at taken. Then we can write the transition probabilities in a factored form
m

the best that can be proved is bounded error from the optimum (where the bound depends on the quality of basis functions used for approximation). The approximate policy iteration algorithm (Koller & Parr, 2000; Guestrin et al., 2002) also uses an approximate LP reformulation, but it is based on the policyevaluation Bellman equations. Policy-evaluation equations are, however, linear and do not contain the maximum operator, so there is no need for a costly transformation step. On the other hand, the algorithm needs an explicit decision tree representation of the policy. Liberatore (2002) has shown that the size of the decision tree representation can grow exponentially. Furthermore, the convergence properties of these algorithms are unknown. Factored value iteration (Szita & Lrincz, 2008a) also o approximates the value function as a linear combination of basis functions, but uses a variant of approximate value iteration: the projection operator is modified to avoid divergence. FVI converges in a polynomial number of steps, but the solution may be sub-optimal. The error of the solution has bounded distance from the optimal value function, where the bound depends on the quality of function approximation. As an integral part of FOIM, FVI is described in detail in Section 4.1. 3.2. Reinforcement Learning in FMDPs In the reinforcement learning setting, the agent interacts with an FMDP environment with unknown parameters. In the model-based approach, the agent has to learn the structure of the FMDP (i.e., the dependency sets i and the reward domains Zj ), the transition probability factors Pi and the reward factors Rj . Unknown transitions. Most approaches assume that the structure of the FMDP and the reward functions are known, so only transition probabilities need to be learnt. Examples include the factored versions of sample-efficient model-based RL algorithms: factored E3 (Kearns & Koller, 1999), factored R-max, or factored MBIE (Strehl, 2007). All the abovementioned algorithms have polynomial sample complexity (in all relevant task parameters), and require polynomially many calls to an FMDP-planner. Note however, that all of the mentioned approaches require access to a planner that is able to produce -optimal solutions1 ­ and to date, no algorithm exists that would accomplish
The assumption of (Kearns & Koller, 1999) is slightly less restrictive: they only require that the value of the returned policy has value at least V  with some  < 1. However, no planner is known that can achieve this and cannot achieve near-optimality.
1

P (y | x, a) =
i=1

Pi (y[i] | x[i ], a)

(3)

for each x, y  X, a  A, where each factor is a localscope function Pi : X[i ] × A × Xi  [0, 1] (for all i  {1, . . . , m}). In vector/matrix notation, for any m vector v  R|X|×1 , P a v = i=1 (Pia v[i ]), where denotes the Kronecker product. Finally, we assume that the size of all local scopes are bounded by a small constant mf m: |i |  mf for all i. As a consequence, all probability factors can be represented with tables having at most Nf := nmf rows. An FMDP is fully characterized by the tuple M = {Xi }m ; A; {Zj }J ; {Rj }J ; {i }m ; {Pi }m ; xs ;  . i=1 j=1 j=1 i=1 i=1

3. Related literature
The idea of representing a large MDP using a factored model was first proposed by Koller and Parr (2000) but similar ideas appear already in the works of Boutilier et al. (1995); Boutilier et al. (2000). 3.1. Planning in known FMDPs Decision trees (or equivalently, decision lists) provide a way to represent the agent's policy compactly. Koller and Parr (2000) and Boutilier et al. (1995); Boutilier et al. (2000) present algorithms to evaluate and improve such policies, according to the policy iteration scheme. Unfortunately, the size of the policies may grow exponentially even with a decision tree representation (Boutilier et al., 2000; Liberatore, 2002). The exact Bellman equations (1) can be transformed to an equivalent linear program with N variables and N ·|A| constraints. In the approximate linear programming approach, we approximate the value function as a linear combination of K basis functions, resulting in an approximate LP with K variables and N · |A| constraints. Both the objective function and the constraints can be written in compact forms, exploiting the local-scope property of the appearing functions. Guestrin et al. (2002) show that the maximum of exponentially many local-scope functions can be computed by rephrasing the task as a non-serial dynamic programming task and eliminating variables one by one. Therefore, the equations can be transformed to an equivalent, more compact linear program. The gain may be exponential, but this is not necessarily so in all cases. Furthermore, solutions will not be (near-)optimal because of the function approximation;

Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs

this accuracy in polynomial time. (Guestrin et al., 2002) also present an algorithm where exploration is guided by the uncertainties of the linear programming solution. While this approach does not require access to a near-optimal planner, no formal performance bounds are known. Unknown rewards. Typically, it is asserted that the rewards can be approximated from observations analogously to transition probabilities. However, if the reward is composed of multiple factors (i.e., J > 1), then we can only observe the sums of unknown quantities, not the individual quantities themselves. To date, we know of no efficient approximation method for learning factored rewards. Unknown structure. Few attempts exist that try to obtain the structure of the FMDP automatically. Strehl et al. (2007) present a method that learns the structure of an FMDP in polynomial time (in all relevant parameters).

We make the further assumption that all the basis functions are local-scope ones: for each k  {1, . . . , K}, hk : X[Ck ]  R, with feature matrices Hk  R|X[Ck ]|×K . The feature matrix H can be deK composed as H = k=1 E[Ck ] Hk . Definition 4.2 For any matrices H and G, let the row-normalization of G be a matrix N (G) of the same size as G, and having the entries [N (G)]k,x = Gk,x . [HG]k,


Throughout the paper, we shall use the projection matrix G = N (H T ). The AVI equation (4) can be considered as the product of the K × N matrix G and an N × 1 vector vt = maxaA ra + P a Hwt . Using the above assumptions and notations, we can see that for any x  X, the corresponding columm of G and the corresponding element of vt can be computed in polynomial time: [G]k,x 1 = [HH T ]k,
J K T Hk  k =1 K
a [ra ]x[Zj ] + j

4. Building blocks of FOIM
We describe the two main building blocks of our algorithm, factored value iteration and optimistic initial model. 4.1. Factored value iteration We assume that all value functions are approximated as the linear combination of K basis functions hk : K X  R: V (x) = k=1 wk hk (x). Let H be the N × K matrix mapping feature weights to state values, with entries Hx,k = hk (x), and let G be an arbitrary K × N linear mapping projecting state values to feature weights. Let w  RK denote the weight vector of the basis functions. It is known (Szita & Lrincz, 2008a) that if HG   1, then the o approximate Bellman equations w× = GmaxaA ra + P a Hw× have a unique fixed point solution w× , and approximate value iteration (AVI) wt+1 := GmaxaA ra + P a Hwt converges there for any starting vector w0 . Definition 4.1 Let the AVI-optimal value function be defined as v× = Hw× . As shown by Szita and Lrincz (2008a), the distance o of AVI-optimal value function from the true optimum is bounded by the projection error of v : v× - v


,x[Ck ]

; Pia (hk wk,t )

[vt ]x = max
aA j=1

E[Ck ]
k=1 iCk

Factored value iteration draws N1 N states uniformly at random, and performs approximate value iteration on this reduced state set. Theorem 4.3 (Szita & Lrincz, 2008a) Suppose o that G = N (H T ) For any > 0,  > 0, if the sample 2 size is N1 = O( m2 log m ), then with probability at  least 1 - , factored value iteration converges to a weight vector w such that w - w×   . In terms of the optimal value function, v× - v


1 1-

HGv - v


+ .

(6)

4.2. Optimistic initial model for flat MDPs There are a number of sample-efficient learning algorithms for MDPs, e.g., E3, Rmax, MBIE, and most recently, OIM. The underlying principle of all these methods is similar: they all maintain an approximate MDP model of the environment. Wherever the uncertainty of the model parameters is high, the models are optimistic. This way, the agent is encouraged to explore the unknown areas, reducing the uncertainty of the models. Here, we shall use and extend OIM to factored environments. In the OIM algorithm, we introduce a hypothetical "garden of Eden" (GOE) state xE , where

(4)


1 1-

HGv - v


.

(5)

Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs

the agent gets a very large reward RE and remains there indefinitely. The model is initialized with fake experience, according to which the agent has experienced an (x, a, xE ) transition for all x  X and a  A. According to this initial model, each state has value RE /(1 - ), which is a major overestimation of the true values. The model is continuously updated by the collected experience of the agent, who always takes the greedy optimal action with respect to its current model. For well-explored (x, a) pairs, the optimism of the model vanishes, thus encouraging the agent to explore the less-known areas. The reason for choosing OIM is twofold: (1) The optimism of the model is ensured at initialization time, and after that, no extra work is needed to ensure the optimism of the model or to encourage exploration. (2) Results on several standard benchmark MDPs indicate that OIM is superior to the other algorithms mentioned.

fake experience), P (y | x, a) = 1, if y = (xE , . . . , xE ); 0, otherwise,

R(x, a) = c · RE , if c components of x are xE . This model is optimistic indeed, all non-GOE states have value at least RE /(1 - ). Note that it is not possible to encode the RE -rewards for the GOE states using the original set of reward factors, so for all state factor i, we add a new reward factor with local scope Xi : Ri : RE , if x = xE ; Xi × A  R, defining Ri (x, a) = 0, otherwise. With this modification, we are able to fully specify our algorithm, as shown in the pseudocode below. Algorithm 1 Factored optimistic initial model. input: M = {Xi }m ; A; {Zj }J ; {Rj }J ; {i }m ; {Pi }m ; xs ;  1 1 1 1 1 {Hk }K ; {Ck }K ; > 0;  > 0; RE 1 1 initialization: t := 0; for all i, add GOE states: Xi := Xi  {xE } for all i, add GOE reward function ri for all i, a, x[i ], y  Xi \ {xE }, let TransitionCount(x[i ], a, y) := 0; TransitionCount(x[i ], a, xE ) := 1; VisitCount(x[i ], a) := 1; repeat [Pia ]x[i ],y := TransitionCounti (x[i ],a,y) . VisitCounti (x[i ],a) wt := FactoredValueIteration(M, {Pia }, , ) update TransitionCount and VisitCount corresponding to transition (xt , at , xt+1 ). t := t + 1 until interaction lasts 5.2. Analysis Below we prove that FOIM gets as good as possible. What is "as good as possible"? We clearly cannot expect better policies than the one the planner would output, were the parameters of the FMDP known. And because of the polynomial-running-time constraint on the planner, it will not be able to compute a near-optimal solution. However, we can prove that FOIM gets -close to the solution of the planner (which is AVI-near-optimal if the planner is FVI), except for a polynomial number of mistakes during its run.2 Theorem 5.1 Suppose that an agent is following
We are using the term polynomial and polynomial in all relevant quantities as a shorthand for polynomial in m, Nf , |A|, Rmax , 1/(1 - ), 1/ and 1/.
2

5. Learning in FMDPs with an Optimistic initial model
Similarly to other approaches, we will make the assumptions that (a) the dependencies are known, and (b) the reward function is known, only the transition probabilities need to be learned. 5.1. Optimistic initial model for factored MDPs During the learning process, we will maintain approximations of the model, in particular, of the transition probability factors. We extend all state factors with the hypothetical "garden of Eden" state xE . Seeing the current state x and the action a taken, the transition model should give the probabilities of various next states y. Specifically, the ith factor of the transition model should give the probabilities of various yi values, given x[i ] and a. Initially, the agent has no idea, so we let it start with an overly optimistic model: we inject the fake experience to the model that taking action a in x[i ] leads to a state with ith component yi = xE . This optimistic model will encourage the agent to explore action a whenever its state is consistent with x[i ]. After many visits to (x[i ], a), the weight of the initial fake experience will shrink, and the optimistic belief of the agent (together with its exploration-boosting effect) fades away. However, by that time, the collected experience provides an accurate approximation of the Pi (yi | x[i ], a) values. So, according to the initial model (based purely on

Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs

FOIM in an unknown FMDP, where all reward components fall into the interval [0, Rmax ], there are m state factors, and all probability- and reward-factors depend on at most mf factors. Let Nf = nmf and let > 0 and  > 0. If the initial values of FOIM satisfy RE = c ·
2 mRmax (1-)4

pt,i = Pt,i (y|x[i ], a). By Theorem 3 of Strehl (2007) (an application of the Hoeffding­Azuma inequality), Pr
yXi

|pi (-pt,i | >

t

 2n exp -

2 t kt (x[i ],a)

2

. (7)

log

mNf |A| (1-) 

,

then the number of timesteps when FOIM makes nonAVI-near-optimal moves, i.e., when QFOIM (xt , at ) < Q× (xt , at ) - , is bounded by O
2 Rmax m4 Nf |A| 4 (1-)4

Unfortunately, the above inequality only speaks about a single time step t, but we need to estimate the failure probability for the whole run of the algorithm. By the union bound, that is at most


log3

1 

log2

mNf |A|

Pr
k=1 yXi
2

|pi - ptk ,i | >
m2 N |A|

tk

.

(8)

with probability at least 1 - . Proof sketch. The proof uses standard techniques from the literature of sample-efficient reinforcement learning. Most notably, our proof follows the structure of Strehl (2007). There are two important differences compared to previous approaches: (1) we may not assume that the planner is able to output a near-optimal solution, and (2) FOIM may make an unbounded number of model updates, so we cannot make use of the standard argument that "we are encountering only finitely many different models, each of them fails with negligible probability, so the whole algorithm fails with negligible probability". Instead, a more careful analysis of the failure probability is needed. The rigorous proof can be found in the extended version of our paper (Szita & Lrincz, 2009) o 5.2.1. Boundedness of value functions According to our assumptions, all rewards fall between 0 and Rmax . From this, it is easy to derive an upper bound on the magnitude of the AVI-optimal value function v× . The bound we get is v×   3- 1- Vmax := V0 . For future reference, we note that Rmax V0 = ( (1-)2 ). 5.2.2. From visit counts to model accuracy The FOIM algorithm builds a transition probability model by keeping track of visit counts to state-action components (x[i ], a) and state-action-state transition components (x[i ], a, y). First of all, we show that if a state-action component is visited many times, then the corresponding probability components Pt,i (y|x[i ], a) become accurate. Let us fix a timestep t  N, a probability factor i  {1, . . . , m} and a state-action component (x[i ], a)  X[i ] × A, and t > 0. Let us denote the number of visits to the component up to time t by kt (x[i ], a). Let us introduce the shorthands pi = Pi (y|x[i ], a) and

f m Let k0 :=  (1-)2 2 log (1-) . For k < k0 , the number of visits is too low, so in eq. (7), either 1 , or the right-hand side is too big. We choose the former: we make the failure probability less than some  constant  by setting tk = (k ) , where ( ) =

1 2(log  + n log 2). For k  k0 , the number of visits is sufficiently large, so we can decrease either the accuracy or the failure probability (or even both). It turns out that an approximation accuracy tk = (1 - )/m is sufficient, so we decrease failure probability. Let us 2 2 m2 Nf |A| set  :=  m3(1-) /log (1-) . With this choice Nf |A|

of  and k0 , ( )  (1 - )/m whenever k  k0 , k 2 furthermore, 2n exp - 2m2   , so we get that


Pr
k=1 k0 -1 yXi

 |pi - ptk ,i | > max( (k ) , 

(1-) m )


k=1

 +
k=k0

k 2n exp - 2m2

2

  k0 +

1 2 1-exp - 2m2


 mNf |A|

.

We can repeat this estimation for every state-action components (x[i ], a). There are at most mNf |A| of these, so the total failure probability is still less than (). This means that |pi - pt,i |  max(
yXi

( ) kt (x[i ], a)

,

(1 - ) ) (9) m

will hold for all (x[i ], a) pairs and all timesteps t with high probability. From now on, we will consider only realizations where the failure event does not happen, but bear in mind that all our statements that are based on (9) are true only with 1 - () probability. From (9), we can easily get L1 bounds on the accuracy of the full transition probability function:  yX P (y|x, a) - Pt (y|x, a)

Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs
m i=1

max( 

( ) kt (x[i ],a)

,

(1-) m )

for all (x, a)  X × A

or equivalently, (P (y | x, a) - PtFOIM (y | x, a))V × (y)
yX m

and for all t. 5.2.3. The known-state FMDP A state-action component (x[i ], a) is called known at timestep t if it has been visited at least k0 times, i.e., if kt (x[i ], a)  k0 . We define the known-component FMDP M Kt as follows: (1) its state and action space, rewards, and the decompositions of the transition probabilities (i.e., the dependency sets i ) are identical to the corresponding quantities of the true FMDP M , and hence to the current approximate FMDP Mt ; (2) for all a  A, i  {1, . . . , m} and x[a ]  X[a ], for i i any yi  Xi , the corresponding transition probability K component is Pt,i (yi |x[i ], a) := Pt,i (yi |x[i ], a), Pi (yi |x[i ], a),
Kt

-
i=1

max(  ,
kt,i

(1-) m )

·

kt,i +1 kt,i V0

+

1 kt,i VE .

Every term in the right-hand side is larger than - (1 - )/m, provided that we can prove the slightly stronger inequality - max(  , (1-) )·2V0 + m
kt,i

First note that if the second term kt,i VE  - dominates the max expression, then the inequality is automatically true, so we only have to deal with the situation when the first term dominates. In this case, 1 the inequality takes the form -  · 2V0 + kt,i VE 
kt,i

1

(1-) m .

if (x[i ], a)  Kt ; if (x[i ], a)  Kt .

- (1-) m RE .

, which always holds because of our choice of

Note that FMDPs M and Mt are very close to each other: unknown state-action components have identical transition functions by definition, while for known components, (1-) K  mV0 . yXi Pt,i (y|x[i ], a) - Pt,i (y|x[i ], a) Consequently, for all (x, a), PtK (y|x, a)
yX

We show by induction that V (t) (x)  V × (x) - ( ) and Q(t) (x, a)  Q× (x, a) - ( ) for all t = 0, 1, 2, . . . and all (x, a)  X × A. The inequalities hold for t = 0. When moving from step t to t + 1, Q(t+1) (x, a) = R(x, a) + 
yX

Pt (y | x, a)V (t) (y)

(1 - ) - Pt (y|x, a)  . V0
K v

 R(x, a) +  (10)
× yX

Pt (y | x, a)(V × (y) - ( ))

 Q (x, a) - ( ) - ((1 - ) ) for all (x, a), where we applied the induction assumption and eq. (11). Consequently, maxaA Q(t+1) (x, a)  maxaA Q× (x, a) - ( ) for all x. Note that according to our assumptions, all entries of H are nonnegative as well as the entries of G = N (H T ), so multiplication by rows of HG is a monotonous operator, furthermore, all rows sum to 1, yielding [HG]y,x max Q(t+1) (x, a) 
xX × aA aA

and v be the value For an arbitrary policy , let functions (the fixed points of the approximate Bellman equations) of  in M Kt and Mt , respectively. By a suitable variant of the Simulation Lemma (see supplementary material) that works with the approximate Bellman equations, we get that whenever (10) holds, K v - v   . 5.2.4. The FOIM model is optimistic First of all, note that FOIM is not directly using the empirical transition probabilities Pt,i , but it is more optimistic; it gives some chance for getting to the garFOIM den of Eden state xE : Pt,i (yi |x[i ], a) =
kt,i kt,i +1 Pt,i (yi |x[i ], a), 1 kt,i +1 ,

[HG]y,x (max Q (x, a) - ( )), that is,
xX

V (t+1) (x)  V × (x) - ( ). 5.2.5. Proximity of value functions The rest of the proof is standard, so we give here a very rough sketch only. We define a cutoff horizon RE 1 H := ( 1- log (1-) ) and an escape event A which happens at timestep t if the agent encounters an unknown transition in the next H steps. We will separate two cases depending on whether Pr(A) is smaller than (1-) or not. If the probability of escape is low, then RE we can show that QFOIM (xt , at )  Q× (xt , at ) - ( ).

if yi = xE ; otherwise,

where we introduced the shorthand kt,i = kt (x[i ], a). Now, we show that Q× (x, a) - R(x, a) + 
yX

PtFOIM (y|x, a)V × (y) (11)

 ( (1 - )) ,

Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs

Otherwise, if Pr(A) is large, then an unknown stateaction component is found with significant probability. However, this can happen only at most mNf |A|k0 times (because all components become known after k0 visits), which is polynomial, so the second case can happen only a polynomial number of times. Finally, we remind that the statements are true only with probability 1 - (). To round off the proof, we note that we are free to choose the constant in the definition of RE (as it is hidden in the (·) notation), so we set it in a way that ( ) and () become at most and , respectively.

(pp. 1104­1111). Boutilier, C., Dearden, R., & Goldszmidt, M. (2000). Stochastic dynamic programming with factored representations. Artificial Intelligence, 121, 49­107. Brafman, R. I., & Tennenholtz, M. (2001). R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. International Joint Conference on Artificial Intelligence (pp. 953­958). Guestrin, C., Koller, D., Parr, R., & Venkataraman, S. (2002). Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research, 19, 399­468. Kearns, M. J., & Koller, D. (1999). Efficient reinforcement learning in factored MDPs. International Joint Conference on Artificial Intelligence (pp. 740­747). Kearns, M. J., & Singh, S. (1998). Near-optimal reinforcement learning in polynomial time. International Conference on Machine Learning (pp. 260­ 268). Koller, D., & Parr, R. (2000). Policy iteration for factored MDPs. Uncertainty in Artificial Intelligence (pp. 326­334). Liberatore, P. (2002). The size of MDP factored policies. AAAI Conference on Artificial Intelligence (pp. 267­272). Strehl, A. L. (2007). Model-based reinforcement learning in factored-state MDPs. IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning (pp. 103­110). Strehl, A. L., Diuk, C., & Littman, M. L. (2007). Efficient structure learning in factored-state MDPs. AAAI Conference on Artificial Intelligence (pp. 645­650). Strehl, A. L., & Littman, M. L. (2005). A theoretical analysis of model-based interval estimation. International Conference on Machine Learning (pp. 856­ 863). Szita, I., & Lrincz, A. (2008a). Factored value iterao tion. Acta Cybernetica, 18, 615­635. Szita, I., & Lrincz, A. (2008b). The many faces of o optimism: a unifying approach. International Conference on Machine Learning (pp. 1048­1055). Szita, I., & Lrincz, A. (2009). Optimistic inio tialization and greediness lead to polynomial time learning in factored MDPs ­ extended version. http://arxiv.org/abs/0904.3352.

6. Discussion
FOIM is conceptually very simple: the explorationexploitation dilemma is resolved without any explicit exploration, action selection is always greedy. The model update and model solution are also at least as simple as the alternatives found in the literature. Further, FOIM has some favorable theoretical properties. FOIM is the first example to an RL algorithm that has a polynomial per-step computational complexity in FMDPs. To achieve this, we had to relax the near-optimality of the FMDP planner. The particular planner we used, FVI, runs in polynomial time, it does reach a bounded error, and the looseness of the bound depends on the quality of basis functions. In almost all time steps, FOIM gets -close to the FVI value function with high probability (for any pre-specified ). The number of timesteps when this does not happen is polynomial. From a practical point of view, calling an FMDP model-solver in each iteration could be prohibitive. However, the model and the value function usually change very little after a single model update, so we may initialize FVI with the previous value function, and a few iterations might be sufficient.

Acknowledgements
This work has been supported by the EC NEST Perceptual Consciousness: Explication and Testing grant under contract 043261. Opinions and errors in this manuscript are the authors responsibility, they do not necessarily reflect the opinions of the EC or other project members. The first author has been partially supported by the Fulbright Scholarship.

References
Boutilier, C., Dearden, R., & Goldszmidt, M. (1995). Exploiting structure in policy construction. International Joint Conference on Artificial Intelligence