Model-based reinforcement learning with nearly tight exploration complexity bounds Istv´n Szita a Csaba Szepesv´ri a University of Alberta, Athabasca Hall, Edmonton, AB T6G 2E8 Canada szityu@gmail.com szepesva@cs.ualberta.ca Abstract One might believe that model-based algorithms of reinforcement learning can propagate the obtained experience more quickly, and are able to direct exploration better. As a consequence, fewer exploratory actions should be enough to learn a good policy. Strangely enough, current theoretical results for model-based algorithms do not support this claim: In a finite Markov decision process with N states, the best bounds on the number of exploratory steps necessary are of order O(N 2 log N ), in contrast to the O(N log N ) bound available for the modelfree, delayed Q-learning algorithm. In this paper we show that Mormax, a modified version of the Rmax algorithm needs to make at most O(N log N ) exploratory steps. This matches the lower bound up to logarithmic factors, as well as the upper bound of the state-of-the-art model-free algorithm, while our new bound improves the dependence on other problem parameters. In the reinforcement learning (RL) framework, an agent interacts with an unknown environment and tries to maximize its long-term profit. A standard way to measure the efficiency of the agent is sample complexity or exploration complexity. Roughly, this quantity tells how many non-optimal (exploratory) steps does the agent make at most. The best understood and most studied case is when the environment is a finite Markov decision process (MDP) with the expected total discounted reward criterion. Since the work of Kearns & Singh (1998), many algorithms have been published with bounds on their samAppearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). ple complexity. Some of these algorithms like Rmax, MBIE, OIM build an approximate model of the environment, while others, like delayed Q-learning, do not, but rather approximate an action value function directly. Intuitively, keeping a model should help: whenever some information is gained from an exploratory move, a model-based algorithm can immediately propagate that information throughout the state space. More efficient use of information means that, at least in principle, less exploration is needed. In this light, it seems surprising that the best known sample complexity bounds are better for model-free methods than for the model-based ones: the sample complexity of delayed Q-learning scales linearly with the number of states, while the previously best known bounds for model-based methods scale quadratically. One may speculate that this is because in a model the number of transitions that need to be stored (and thus estimated) is quadratic in the number of states. Here we show that this is not the case: The quadratic bounds of earlier results were not due to the inherent limitations of the model-based approach. Specifically, we present a slight modification of the Rmax algorithm and show that (disregarding logarithmic factors) its sample complexity scales only linearly with the number of states, and has better dependence on the value upper bound Vmax and the required precision than the previous best bound, which was available for a model-free algorithm. 1. Markov decision processes A standard assumption of RL is that the environment is (discounted-reward, finite) MDP. Here we only introduce the notational framework used in the paper, for detailed explanations we refer the reader to Bertsekas & Tsitsiklis (1996). Technically, a finite MDP M Model-based reinforcement learning ­ exploration complexity bounds is a triple (X, A, P), where X is a finite set of states; A is a finite set of possible actions; and P, determining the evolution of the decision process, maps X × A into distributions over X × R (states and rewards). We assume that all the (random) immediate rewards are nonnegative and are upper-bounded by a constant Rmax 0. By slightly abusing the notation when the stochasticity of the rewards is not a concern, we can identify M with the 4-tuple (X, A, P, R) for an MDP, where P : X × A × X is the transition kernel and R : X × A [0, Rmax ] is the immediate expected reward function underlying P. A policy is a mapping that assigns to every history h a probability mass function over the actions A. Following a policy in the MDP means that at (·|ht ). A stationary policy is identified with a mapping : X × A [0, 1]. We assume that future rewards are discounted by a multiplicative factor [0, 1). The discounted state- and action-value functions will be denoted by V and Q , respectively, for any (not necessarily stationary) policy . The optimal state- and action-value functions will be denoted by V and Q . If the parameters of the MDP, P and R are known, then the optimal value function (and thus the optimal policy) can be found by an iterative solution of the Bellman-equations (Bertsekas & Tsitsiklis, 1996). In the reinforcement learning setting, however, P and R are unknown, and the agent has only access to direct experience: being in state xt and taking action at , it can observe the next state xt+1 and the reward rt . For simplicity, in this paper we assume that the reward function is known, while the transition probabilities are not.1 Exploration. In an unknown environment, the agent has to make exploratory actions to learn the effects of its actions, and thus learn how to get to the most rewarding parts of the state space. This means that from time to time, the agent has to take actions other than the ones that seem best for the moment, in the hope that it finds something better. Failing to do that, the agent may never find the optimal policy. On the other hand, too much exploration reduces the amount of collected rewards unnecessarily. Finding the right balance is non-trivial. Model-based and model-free learning. There are three main branches of RL methods for learning in MDPs. Of course, the boundaries of these three categories are somewhat blurred. Model-based methods approximate the transition The results would continue to hold in the more general case with some obvious modifications. 1 probabilities and the reward function, then derive a value function from the approximate MDP. The policy of the agent is then derived from the value function. Model-free methods directly learn a value function (and from that, a policy). The third branch consists of policy search algorithms which aim at finding a policy without resorting to estimating value functions. As no sample complexity results are known about policy search methods, we will not discuss them in this paper. 1.1. Sample complexity of reinforcement learning One natural evaluation criterion for RL algorithms is to count the number of non-optimal actions taken (these are the exploratory actions or those that the agent mistakenly believes to be optimal). Before giving the precise definition, we need the concept of MDP learning algorithms. An MDP learning algorithm (in short, algorithm) is a procedure which gets as input the number of states, actions, , Rmax (and possibly other parameters) of some MDP M and then interacts with M by sequentially observing states and rewards and sending actions to M . Actions must still be chosen based on past observations. Given x0 , a0 , r0 , x1 , a1 , r1 , . . ., a stream of experience generated when algorithm A interacts with M , we define its (expected future) value at time t, conditioned on A the past ht as Vt,M = E [ s=0 s rt+s | ht ]. Note that A Vt,M is a random variable. Let Vmax be an upper bound on the state values in the MDP. (Clearly, we can choose Vmax = Rmax /(1 - ), but sometimes much better bounds are available.) Definition 1 (Kakade, 2003) Let > 0 be a prescribed accuracy and > 0 be an allowed probability of failure. The expression ( , , N, K, , Rmax ) is a sample complexity bound for algorithm A, if the following holds: Take any > 0, (0, 1), N > 0, K > 0, [0, 1), Rmax > 0 and any MDP M with N states, K actions, discount factor , and rewards bounded by Rmax . Let A interact with M , resulting in the process x0 , a0 , r0 , x1 , a1 , r1 , . . .. Then, independently of the choice of x0 , with probability at least 1 - , the A number of timesteps such that Vt,M < V (xt ) - is at most ( , , N, K, , Rmax ).2 An algorithm with sample complexity that is polynomial in 1/, log(1/), N , K, 1/(1 - ), Rmax is An alternative way for measuring sample complexity is to count the number of timesteps when Q (xt , at ) < V (xt ) - . If is a uniform bound on this count, and is as above then one can show that . 2 Model-based reinforcement learning ­ exploration complexity bounds called PAC-MDP (probably approximately correct in MDPs). 1.2. Previous sample complexity results The above definition of sample complexity is due to Kakade (2003), though algorithms with polynomial sample complexity existed before. The first such algorithm was E3 (Kearns & Singh, 2002), which was followed by other, simpler algorithms like Rmax (Brafman & Tennenholtz, 2002), model-based intervalestimation (MBIE) (Strehl & Littman, 2005) and optimistic initial model (OIM) (Szita & Lrincz, o 2008), all of which are model-based. The stateof-the-art complexity bound for model-based learning in MDPs is due to Li (2009), who showed that Rmax takes at most O 3 3 N 2 KVmax 3 (1-)3 estimated, the agent will be drawn towards them until they are explored. The original algorithm needs m = O(N log N ) samples (considering only N -dependency) per state-action (SA) pairs. After collecting this many samples for an SA-pair, Rmax considers the pair known, and stops collecting further information. The analysis of delayed Q-learning (Strehl et al., 2006) and parallel sampling (Kearns & Singh, 1998) suggests that m = O(log N ) samples might be enough if we keep collecting fresh data, and when an m-element batch of new data is collected, we re-estimate the model if necessary. Obviously, if we want any benefits, the number of re-estimations cannot be too large. Let t be the current time step and t,x,a be the last time prior to t when an update of the model at (x, a) was considered (or 0 if no update happened yet at (x, a)). We will show that no update is necessary, if no updates happened elsewhere in the model between t,x,a and t. Nor is an update necessary unless the value of a SA-pair is decreased significantly, say, by c · with some appropriate constant c > 0. Consequently, any (x, a) pair will be updated at most Vmax /(c ) times, which is affordable. The resulting Mormax algorithm is listed as Alg. 1.4 To increase readability, we omitted subscripts. To match with the quantities used in the proofs, set a at , x xt , y xt+1 . Other than that, all left-hand side quantities should be assigned index t + 1, right-hand side ones index t. One exception is Qprev (x, a), which refers to Qx,a (x, a). Note that if we simplify the condition in line 14 to `if (x,a = 0) then. . . ' and increase m, we get back the original Rmax algorithm. We note that the naive representation of P requires O(N 2 K) space. Using a more effective implementation, such as the one used by Kearns & Singh (1998), does not store the full transition probability table, only the m = O(log(N K)) samples for each SA-pair, so the total space complexity can be cut down to O(N K). Theorem 1 Fix some prescribed accuracy > 0, failure probability (0, 1), and discount factor [0, 1). Let M = (X, A, P, R) be an MDP with |X| = N states and |A| = K actions, with nonnegative rewards, and a value Vmax R that is an upper bound on all discounted cumulated rewards. If the Mormax algorithm runs on MDP M then with probability at least 1-, the Mormax number of time steps t for which VM,t < VM (x)- is bounded by O 4 2 N KVmax (1-)4 2 exploratory ac- tions. The only model-free method with known complexity bounds is for delayed Q-learning (Strehl et al., 2006), which requires at most O ploratory actions. 4 N KVmax 4 (1-)4 ex- It is worth mentioning that if the agent has access to an oracle that can draw samples from P (x, a, y) for any (x, a) pair, then O 2 N KVmax 2 (1-)2 samples are suffi- cient (Kearns & Singh, 1998). The best available lower 2 N KRmax log N ) exbound is by Li (2009): at least ( 2 ploratory moves are necessary for any deterministic algorithm that learns from samples. An alternative interesting measure for the effectiveness of an RL algorithm is regret, the total reward lost by the algorithm in comparison to the optimal policy (e.g., Auer et al. 2009). Discussion of regret bounds is out of the scope of the present paper. 2. The modified Rmax algorithm The original Rmax algorithm is an archetypical example of a model-based RL algorithm. The model of Rmax is not accurate (especially not at the beginning of learning when the agent has basically no information about the environment), but it ensures that whenever there is uncertainty about a state, its value will be overestimated. Initially, the model assumes that each state is worth Rmax immediate reward (hence the name), and all actions self-loop, thus their value is Vmax = Rmax /(1 - ). As soon as enough samples are collected about a state, the agent estimates the transition probabilities (and rewards) from that state, then recalculates the optimal policy according to the current model. As the value of unknown states is over3 . Mormax stands for modified Rmax. The notation O(n) stands for O(n log n). Model-based reinforcement learning ­ exploration complexity bounds Algorithm 1 The Mormax algorithm 1: Input: discount , accuracy , allowed failure prob. , value upper bound Vmax 2: Initialization: 2 2 1800 2 Vmax K2V 3: t := 0; := 0; 2 := (1 - ) /(15); m := 2 (1-)2 log 144N(1-) max 4: for all (x, a, y) X × A × X do 5: n(x, a, y) := 0; n(x, a) := 0 P (x, a, y) := I {x = y}; R(x, a) := (1 - )Vmax ; x,a := 0; 6: Q := SolveMDP(P , R, ) 7: Receive x from the environment 8: repeat 9: a := arg maxa Q(x, a ) //greedy action w.r.t. current optimistic model 10: interact: Apply a and observe r and y 11: n(x, a) := n(x, a) + 1; n(x, a, y) := n(x, a, y) + 1 // update counters 12: if n(x, a) = m then 13: // if (x, a) is not yet known or change is significant, we update the model 14: if (x,a = 0) [( > x,a ) (Qprev (x, a) - Q(x, a) > 2 )] then 15: P := P , R := R; P (x, a, ·) = n(x, a, ·)/m; R (x, a) = R(x, a) 16: Q = SolveMDP(P , R , ) // calculate trial model 17: if Q (x, a) Q(x, a) then 18: P := P ; R := R ; Q := Q // model update 19: x,a := t; := t; Qprev (x, a) := Q(x, a) 20: n(x, a) := 0, n(x, a, ·) := 0 // restart sample collection at (x, a) 21: t := t + 1; x := y 22: until interaction lasts 3. Bounding sample complexity For a reader familiar with RL sample complexity bounds, most of the following proof may look familiar: the logic of the proof is fairly standard. Elements of the proof appear in the works of Kearns & Singh (1998); Kakade (2003); Szita & Lrincz (2008); o Li (2009) and Strehl et al. (2006). Nevertheless, we provide a detailed sketch of the proof because our improvement on the key lemma induces changes in all other parts of the proof, so citing the original versions of various lemmas from the abovementioned papers would inevitably lead to confusion and loss of rigour. For lack of space, easy and/or well-known results are stated as lemmas without proofs. 3.1. Definitions Let us fix 1 = O( ) and 1 = O(), with their exact values defined later. For any time step t, let xt be the actual state and at be the action of the agent; and let Ft be the -algebra defined by the history up to t, that is, Ft = {x0 , a0 , r0 , . . . , xt , at , rt , xt+1 }. 5 Update times, phase end times, known SApairs. Given t 1, we define the phase end times 5 Readers not experienced with measure theory can read "quantity X is Ft measurable" as "if we know the history up to t, then we can also compute quantity X". t,x,a as the last time step before t when m new samples for (x, a) were collected. Furthermore, we have already defined the update times t,x,a as the last time a model update was considered. A model update is considered whenever Qt (x, a) has decreased significantly since the last update time. It generally means that the model is updated--unless that would cause an increase of Qt (in this case only is updated). Let Kt be the set of SA-pairs that are "known" at time t, that is, where the agent has a reasonable approximation to def P (x, a, ·): Kt = {(x, a) X × A : t,x,a > 0}. Approximate model. The approximate model Mt = (X, A, Pt , Rt ) is defined with Pt (x, a, y) = def nt,x,a (x, a, y)/m, I {x = y} , if (x, a) Kt ; if (x, a) Kt , 6 Rt (x, a) = def R(x, a), if (x, a) Kt ; Rmax , if (x, a) Kt . Let t be the optimal policy of the MDP Mt and Qt Mt be its action-value function. Truncation, 1 -horizon. Fix an MDP M , discount factor , and a (not necessarily memoryless) policy . Let R0 , R1 , . . . be the random sequence of re6 I {·} is the indicator function. Model-based reinforcement learning ­ exploration complexity bounds wards generated by M and when the initial state is chosen uniformly at random from X. For any H N and x X, define the H-step truncated value def H-1 i VM (x; H) = E i=0 Ri |x0 = x . Define the effective 1 -horizon H = H( 1 ) = def 1 1- ln Vmax . 1 The "nice" set. Let 2 = (1 - )/(15) and let Lt be the set of SA-pairs where the effect of using the approximate model is small: Lt = {(x, a) Kt : yX def (Pt -P )(x, a, y)V t (y) 3 2 }. Mt Note that all nice SA-pairs are known, but the set can be much smaller: we will prove later that (x, a) is nice if no model update is likely to happen in H steps, if the agent starts form (x, a) with its current knowledge. The Idea of nice sets appears first in the analysis of delayed Q-learning (Strehl et al., 2006), but in our case, the situation is more complex, because model updates in a single (x, a) may affect the whole state space. Escape event. Define Et as the event that the approximate model of the agent will change within the next H steps, either because a state becomes known or because the transition probabilities of an SA-pair are updated. Formally, Et = def H-1 i=0 {Qt+i+1 Lemma 2 Fix (0, 1), > 0, Vmax > 0, m N, an arbitrary MDP M = (X, A, P, R), and a (possibly history dependent) policy of M . Let x0 , a0 , r0 , x1 , a1 , r1 , . . . , xt-1 , at-1 , rt-1 , xt , . . . be the sequence of state-action-reward triplets generated by following in M from a random initial state x0 . Let Ft = {x0 , a0 , r1 . . . , xt , at , rt+1 , xt+1 }, and be a stopping time w.r.t. {Ft }. Fix (x, a) X × A and let tk be the k th time step such that t and (xt , at ) = (x, a) (possibly infinity). Let V : X R be any F -measurable value function satisfying 0 V (z) Vmax for all z X. Define the approximate transition probabilities for all y X: P (y) = def 1 m m I {ti < ; xti +1= y}+I(ti = )P (x, a, y) . i=1 max If m log 2 then with probability 1 - , 2 (P (x, a, y) - P (y))V (y) < . yX 8V 2 = Qt+i } {xt+i Lt+i } . Let et be the conditional probability that Et happens: def et = P(Et |Ft ) . Auxiliary MDP. We define an MDP M t = (X, A, P t , Rt ) which is the same as M on Lt ; all states in X \ Kt are absorbing with value Vmax , and the states in Kt \ Lt are also absorbing, but have the same value as in Mt for policy t : P t (x, a, y) = Rt (x, a) = def P (x, a, y), if (x, a) Lt ; I {x = y} , otherwise, R(x, a), if (x, a) Lt ; (1 - )Qt (x, a), otherwise. def Similar versions of the lemma can be found in the related literature, e.g. Lemma 9.1.1. of Kakade (2003). There are two significant differences: (1) Previous variants of the lemma did not consider the possibility ti = (which may happen with nonzero probability, for example, if the algorithm decides that no path containing x can be optimal, so it stops visiting it). Therefore, it is not correct to apply the previous variants unless under all policies the visit probabilities of all states are uniformly bounded away from zero. (2) We allow the value function V to be random as long as it is F -measurable. Therefore, we do not need to get a good approximation on all possible instances of the N -dimensional space of value functions, only the polynomially many (random) value functions Vt that we encounter during the run of the algorithm. This allows us to use fewer samples, but the improvement comes at a price: we need to make some changes to the algorithm itself so that the lemma becomes applicable. So, understanding the assumptions of the lemma is necessary to understand why the changes in the algorithm are necessary. 3.3. The number of non-optimal steps Let Vt be the shorthand notation for V t , the optimal Mt value function in the approximate model. We fix an (x, a) pair, then use the shorthands p(y) = P (x, a, y) and pt (y) = Pt (x, a, y). Bounding the number of updates. The for an SA-pair (x, a) can be updated once when it is added to the set Kt , and Vmax / 2 times when the probabilities are updated. To see the latter, let t > t1 be two consecutive update times (with the possibility t1 = 0), Note that the agent does not know M t (as it does not know the true transition probabilities and Lt ), but M t will be useful in the proof to connect policy values in Mt and M . 3.2. The fundamental approximation lemma We begin the proof with a lemma that bounds the distance of the empirical transition probability estimates to their true values. This lemma is responsible for determining the complexity bound of the algorithm. Model-based reinforcement learning ­ exploration complexity bounds and note that the update condition can be written as yX pt1 (y)Vt1 (y) - pt (y)Vt (y) 2 . Therefore, the quantity yX ptk (y)Vtk (y), k = 0, 1, 2, . . ., decreases by at least 2 on every update (maybe more, if Q (x, a) Qt (x, a)). It is at most Vmax , and is always nonnegative, so at most Vmax / 2 updates can happen because of each SA-pair. Thus, the total number of updates is at def most = N K(Vmax / 2 + 1). The value function is monotonous. Vt is monotonously decreasing: if there is no update at time t + 1, then Vt+1 = Vt ; and if there was an update in SA-pair (x0 , a0 ), then Vt+1 (x0 ) Vt (x0 ), so it is easy to see that Vt+1 (x) Vt (x) for all x. Nice set changes only when there is an update. Fix an SA-pair (x, a) Kt and timestep t. Let t1 = t,x,a t be the phase end preceding t. If there were no updates anywhere in the MDP between t1 and t and (x, a) Lt1 , then (x, a) Lt as well, because the set Lt can only change if there is an update to the model. If an SA-pair is not updated twice in a row, it is "nice". Fix an SA-pair (x, a) and a time step t. With a slight abuse of notation, we denote the kth phase end of (x, a) by k . For some k, t,x,a = k . Let t2 = k+1 the random time of the phase end following t and t3 = k+2 be the next phase end. Note that both k+1 and k+2 are potentially infinity with some probability. Let Z be the event that t3 is finite and no update is done either at t2 or t3 . (When we will need to refer to Z, t2 , or t3 and the identity of t, x and a will be important, we will use the respective symbols Zt,x,a , tx,a (t) and tx,a (t).) 2 3 For every k N and (x, a) X × A, let Fk,x,a be the failure event when (p(y) - p (y))Vk+1 (y) 2 k+2 does not hold (where p is the semi-empirical approximation defined in Lemma 2). The value function Vk+1 is Fk+1 -measurable, so we can apply Lemma 2 to get P(Fk,x,a ) 1 if m max log 2 .7 Note that the 2 1 2 failure probability is small for each k, but their sum for all k N would still grow indefinitely. Luckily, we only have to re-check the accuracy of the approximation when the nice set changes, and by the previous point, this only happens if Qk = Qk+1 . So let us define the failure event F = def (x,a)X×A kN 8V 2 Let us restrict our attention now to the event Z F c (intuitively, this is the event when there are at least two more phase ends after t, we do not update the value function at either of them, and the checking of (p(y) - p (y))Vk+1 (y) 2 never fails). On Z k+2 c F , t2 and t3 are finite, so p = pt2 and p = pt3 . t2 t3 Furthermore, both update checks returned with the result that an update is not necessary, so yX yX pt1 (y)Vt1 (y) - pt2 (y)Vt2 (y) pt2 (y)Vt2 (y) - pt3 (y)Vt3 (y) 2 2 and . Note that pt can only be updated at phase ends, so pt = pt1 . Using these, and the monotonicity of Vt , pt (y)Vt (y) = p(y)Vt (y) pt1 (y)Vt (y) pt3 (y)Vt3 (y) + 2 p(y)Vt2 (y) pt3 (y)Vt3 (y) - 2 2 pt1 (y)Vt1 (y) ; 2 pt3 (y)Vt2 (y) - , so, on F c Z, (x, a) Lt because (pt (y) - p(y))Vt (y) 3 2 . (1) Bounding the failure probability. For any fixed (x, a), let = (x, a) be the number of times when the model was updated during the interval [k , k+1 ). Denote the corresponding indices of k by k1 , . . . , k . The quantity is random, but is upper bounded by the total number of updates to the model, which is upper bounded by the deterministic quantity . For , define k = k , then P(F ) P (x,a)X×A x,aX×A =1 {1...} Fk ,x,a P(Fk ,x,a ) N K1 . Fk,x,a {Qk = Qk+1 } . 7 Note that if kl is a stopping time, we still have P(Fkl ,x,a ) 1 . Bound on the number of times when encountering a non-nice SA-pair. We wish to bound the cardinality of B = {t|(xt , at ) Lt } on F c . Since we have shown that on F c Zt,x,a , (x, a) Lt holds, it c follows that, B {t | Zt,xt ,at } (on F c ). Now, x,a c Zt,x,a happens when either t3 (t) is infinite or an update is done at tx,a (t) or tx,a (t). Therefore, on 2 3 x F c , B (x,a)X×A {t | (x, a) = (xt , at ), t3 t ,at (t) = xt ,at x } {t | update at t2 (t)} {t | update at t3 t ,at (t)}. The cardinality of the first set in the right-hand side is at most 2mN K, since for each SA-pair (x, a) which is visited a finite number of times, there is a time when it is visited the last time. Then, tx,a (t) = can only 3 happen at most 2m previous visits of (x, a) before this last visit. In order to bound the cardinality of the second set, recall that the total number of updates is . Model-based reinforcement learning ­ exploration complexity bounds x Pick any update with update time u. If u = t2 t ,at (t) for some time t, then t must be one of the at most m visits to (xt , at ) that happened just before u. Since this way all elements of the said set are covered, the set's cardinality cannot be larger than m. The same reasoning holds for the third set, showing that the total cardinality is bounded by 2m + 2mN K. for every (x, a) X × A, then V1 (x) - V2 (x) every x X. 1 for For any (x, a) Lt , P t (x, a, y) = P (x, a, y). Recall that Lt is the set of SA-pairs where the effect of changing the model is in some sense small: Lt = {(x, a) Kt : yX (Pt - P )(x, a, y)V t (y) 3 2 }. Mt Counting non-optimal steps. Let Tescape = t=0 I {et > 1 /Vmax } be the number of time steps where the chance of running into a non-nice SA-pair in H steps is high. Lemma 3 Let F be the failure event that Tescape > 12mN KH 2 + 2H Then P(F ) 1 . Next, we will show that except for these at most Tescape time steps, our algorithm is near-optimal. 3.4. Proving near-optimality Fix a time step t and let us assume that et 1 /Vmax . 1 12 Vmax mN KH log def On (Lt )c , P t (x, a, y) = P (x, a, y), so yX (Pt - P t )(x, a, y)V t (y) 3 2 = 1 (1 - )/ holds for every Mt SA-pair, and by Lemma 4 this means that VMt (x) V t (x) - t Mt 1 for all x X. (5) 1 1 + 4H log 1 1 . Optimality of t . The optimal policy of Mt is t , so its value is at least as large as for any other policy, for example : V t (x) VM (x) for all x X. Mt t (6) Truncation. All rewards are nonnegative, so horizonH truncation can only decrease values, so Mormax Mormax Vt,M (x) Vt,M (x; H) for all x X: (2) Substitution with another auxiliary model. Let us introduce now a new auxiliary MDP Mt = (X, A, Pt , R) that is identical to Mt on Kt and to P (x, a, y), if (x, a) Kt ; def M outside: Pt (x, a, y) = P (x, a, y), if (x, a) Kt . The following lemma tells that if the one-step lookaheads are consistently smaller in one MDP than in an other, then the value function of a fixed policy is also smaller there: Lemma 5 Consider two MDPs M1 = (X, A, P1 , R1 ), M2 = (X, A, P2 , R2 ), and fix discount factor and stationary policy . Let V1 and V2 be the value functions of in their respective MDPs and let v RX be an arbitrary value function. If R1 (x, a) + yX Substitution with a stationary policy and an auxiliary model. The event that the model is updated in the next H steps is contained in the event Et which has low probability: et = P(Et |Ft ) 1 /Vmax . Furthermore, the event that the agent leaves Lt in the next H steps is also contained in Et . Therefore, on (Et )c , the policy of Mormax stays stationary at t , and it does not leave the set Lt . We also know that on Lt , M and M t are identical. Therefore, Mormax VM (xt ; H) VMt (xt ; H) - t 1 . (3) P1 (x, a, y)v(y) P2 (x, a, y)v(y) yX Undo truncation. If the horizon time 1 H 1- log Vmax , the effect of truncation is 1 small enough to give VMt (x; H) VMt (x) - t t R2 (x, a) + 1 for all x X. (4) for every (x, a) X × A, then V1 (x) V2 (x), x X. Applying the lemma to Mt and Mt , we get that VM (x) VM (x) for all x X. t t Approximation with the empirical model. If the transition probabilities of two MDPs are close to each other, then so are the value functions of a fixed policy: Lemma 4 Let M1 = (X, A, P1 , R), M2 = (X, A, P2 , R) be two MDPs, and fix discount factor , precision 1 , and policy . Let V1 and V2 be the value functions of in their respective MDPs. If yX (P1 (7) Substitution with the real model. For all (x, a) X × A, consider the quantity dt (x, a) = yX (P (x, a, y) - Pt (x, a, y))V (y). - P2 )(x, a, y)V1 (y) 1 (1 - )/ For (x, a) Kt , Pt (x, a, y) = P (x, a, y), so dt (x, a) = Model-based reinforcement learning ­ exploration complexity bounds 0. For (x, a) Kt , Pt = Pk for some k > 0 with k t. The optimal value function V is Fk-1 -measurable (as it is non-random), so, by Lemma 2, if m 2 8Vmax 2 2 log 2 1 , then yX (P (x, a, y)- P (x, a, y))V (y) except for an event Fx,a with probability P(Fx,a ) 1 . Because of k t, k is finite, so P (x, a, y) = Pt (x, a, y) = Pt (x, a, y) for all y X, so dt (x, a) 2 < 3 2 = 1 (1 - )/. Let F = x,aX×A Fx,a , for which P(F ) N K1 . Consider now the case when event (F )c holds. In this case, by Lemma 4, VM (x) VM (x) - t away the old samples. The flow of fresh samples is crucial for the proof. However, from a practical point of view, throwing away information seems suboptimal, and is not clear whether it is really necessary. We also note that sample complexity bounds, including ours, are extremely conservative, because they are for the worst case. In practice, much lower values of m are used, so re-estimation of the probabilities can have a significant effect. On the other hand, preserving old samples could have more benefits, too. Acknowledgments This research was supported by iCORE and Alberta Ingenuity, both part of Alberta Innovates ­Technology Futures, NSERC and the PASCAL2 Network of Excellence under EC grant no. 216886. Cs. Szepesv´ri is a on leave from MTA SZTAKI. 1 for all x X. (8) Bounding failure. The algorithm may encounter three failure events: F , F and F . Their total probability is P(F F F ) P(F ) + P(F ) + P(F ) N 2 · K 2 · (Vmax / 2 + 1)1 + 1 + N K1 This is at most for 1 = /(4N 2 K 2 Vmax / 2 ). Apart from this less-than chance of failure, the algorithm will make at most 1 12mN KH 2 + 2H 12 Vmax mN KH log 1 + 4H log 1 1 1 non-near-optimal steps. In the rest of the time, by Mormax (xt ) VM (x) - 5 1 . adding up eqs. (2)­(8), VM If 1 = /5, we get -optimality. Expressed with the original parameters, the necessary sample size is m= 2 8Vmax 2 2 2 2 144N 2 K 2 Vmax 1800 2 Vmax log log 2 1 (1 - )2 (1 - ) References Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), NIPS-21, pp. 89­96. MIT Press, 2009. Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996. Brafman, R. I. and Tennenholtz, M. R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213­231, 2002. Kakade, S.M. On the sample complexity of reinforcement learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. Kearns, M. and Singh, S. Finite-sample convergence rates for Q-learning and indirect algorithms. In Jordan, M. I., Kearns, M. J., and Solla, S. A. (eds.), NIPS-10, pp. 996­ 1002. MIT Press, 1998. Kearns, M.J. and Singh, S.P. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2­3): 209­232, 2002. Li, L. A Unifying Framework for Computational Reinforcement Learning Theory. PhD thesis, Rutgers University, New Brunswick, NJ, USA, October 2009. Strehl, A. L. and Littman, M. L. A theoretical analysis of model-based interval estimation. In De Raedt, L. and Wrobel, S. (eds.), ICML 2005, pp. 857­864. ACM, 2005. Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and Littman, M. L. PAC model-free reinforcement learning. In Cohen, W. W. and Moore, A. (eds.), ICML 2006, pp. 881­888. ACM, 2006. Szita, I. and Lrincz, A. The many faces of optimism: a o unifying approach. In Cohen, W. W., McCallum, A., and Roweis, S. T. (eds.), ICML 2008, pp. 1048­1055. ACM, 2008. and the upper bound on the non-optimal time steps is N KV 2 O (1-)max . 4 2 4. Discussion For the sake of simplicity, we assumed that the reward function is known, and only the transitions have to be learned. When the reward is also learnt, we need to make only slight modifications to the algorithm and the analysis. For example, instead of bounding y (Pt (x, a, y) - P (x, a, y))V (y), we need to bound y (R(x, a) + Pt (x, a, y)V (y)) - (R(x, a) + P (x, a, y)V (y)). The sample complexity bounds remain exactly the same. Of all PAC-MDP algorithms, proving the bounds of Rmax is the simplest. Nevertheless, the efficiency of the other listed methods is proven using the same tricks, so we believe that their complexity upper bounds can be improved likewise, so that the dependence on N is log-linear. A strange property of Mormax is that when a new batch of samples is collected in an SA-pair, it throws