Near-optimal Regret Bounds for Reinforcement Learning Peter Auer Thomas Jaksch Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria {auer,tjaksch,rortner}@unileoben.ac.at Abstract For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s there is a policy which moves from s to s in at most D steps (on average). We present a rein ~ forcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound hols with high probability. We also present a corresponding lower bound of d ( DS AT ) on the total regret of any learning algorithm. Both bounds demonstrate the utility of the diameter as structural parameter of the MDP. 1 Introduction In a Markov decision process (MDP) M with finite state space S and finite action space A, a learner in state s S needs to choose an action a A. When executing action a in state s, the learner receives a random reward r according to a distribution q (r|s, a) on [0, 1]. Further, according to the transition probabilities p (s |s, a), a random transition to a state s S will be observed. Reinforcement learning of MDPs is considered as a standard model for learning with delayed feedback. In contrast to substantial other work on reinforcement learning -- where the performance of the learned policy is considered (see e.g. [1, 2] and also the discussion and references given in the introduction of [3]) -- we are interested in the performance of the learning algorithm during learning. For that, we compare the rewards collected by the algorithm during learning with the rewards of an optimal policy. In this paper we will consider undiscounted rewards. The accumulated reward of an algorithm A after T steps in an MDP M is defined as T R(M , A, s, T ) := t=1 rt , where s is the initial state and rt are the rewards received during the execution of algorithm A. The average reward 1 (M , A, s) := lim T E [R(M , A, s, T )] T can be maximized by an appropriate stationary policy : S A which defines an optimal action for each state [4]. The difficulty of learning an MDP does not only depend on its size (given by the number of states and the number of actions), but also on its internal structure. In order to measure this internal structure we propose a new parameter, the diameter D of the MDP. The diameter D is the minimal average time it takes to move from any state s to any other state s , using an appropriate policy: 1 Definition 1. Let T (s |M , , s) be the first (random) time step in which state s is reached when policy is executed on MDP M with initial state s. Then the diameter of M is given by D(M ) := max min E [T (s |M , , s)] . s,s S :S A A finite diameter seems necessary for interesting bounds on the regret of any algorithm with respect to an optimal policy, since otherwise some parts of the MDP are not reachable. Moreover, when a learner explores suboptimal actions, this may take him into a "bad part" of the MDP from which it may take D steps to reach again a "good part" of the MDP. Hence, the learner may suffer regret D for such exploration, and it is very plausible that the diameter thus appears in the regret bound. For MDPs with finite diameter (which usually are called communicating, see e.g. [4]) the optimal average reward does not depend on the initial state (cf. [4], Section 8.3.3), and we set (M ) := (M , s) := max (M , , s). The optimal average reward is the natural benchmark for a learning algorithm A, and we define the total regret of A after T steps as1 (M , A, s, T ) := T (M ) - R(M , A, s, T ). In the following, we present our reinforcement learning algorithm U C R L 2 (a variation of the UCRL algorithm of [5]) which uses upper confidence bounds to choose an optimistic policy. We show | ~ that the D tal regret of U C R L 2 after T steps is O(D|S | A|T ). A corresponding lower bound to of ( |S ||A|T ) on the total regret of any learning algorithm is given as well. These results establish the diameter as an elemental parameter of an MDP. Further, the diameter seems to be more natural than other parameters that have been proposed for various PAC and regret bounds, such as the mixing time [3, 6] or the hitting time of an optimal policy [7] (cf. discussion below). In particular, the diameter does not refer to an optimal policy but only to the transition structure of the MDP. 1.1 Relation to Previous Work We first compare our results to the well-known PAC bounds for the algorithms E3 of Kearns, Singh [3], and R-Max of Brafman, Tennenholtz [6] (see also Kakade [8]). These algorithms achieve -optimal average reward with probability 1 - after time polynomial in 1 , 1 , S , A, and the mix m ing time T ix (see below). As this polynomial dependence on is of order 1/3 , the PAC bounds translate into T 2/3 regret bounds at the best. Moreover, both algorithms need the -return mixing m m time T ix of an optimal policy as input parameter.2 This parameter T ix is the number of steps mix until the average reward of over these T steps is -close to the optimal average reward . m It is easy to construct MDPs of diameter D with T ix D/. This additional dependency on further increases the above mentioned regret bounds for E3 and R-max. Also, the exponents of the m parameters |S |, |A|, and T ix in the PAC bounds of [3, 6] are substantially larger than in our bound. The MBIE algorithm of Strehl and Littman [9, 10] -- similarly to our approach -- applies confidence bounds to compute an optimistic policy. However, Strehl and Littman consider only a discounted reward setting, which seems to be less natural when dealing with regret. Their definition of regret measures the difference between the rewards3 of an optimal policy and the rewards of the learning algorithm along the trajectory taken by the learning algorithm. In contrast, we are interested in the regret of the learning algorithm in respect to the rewards of the optimal policy along the trajectory of the optimal policy.4 It can be shown w at maxA E [R(M , A, s, T )] = T (M ) + O(D(M )) and maxA R(M , A, s, T ) = th ~ T (M ) + O T ith high probability. 2 m The knowledge of this parameter could be eliminated by guessing T ix to be 1, 2, . . ., so that sooner mix or later the correct T will be reached (cf. [3, 6]). However, since there is no condition on when to stop m increasing T ix , the assumed mixing time eventually becomes arbitrarily large, so that the PAC bounds become m exponential in the true T ix (cf. [6]). 3 Actually, the state values. 4 Indeed, one can construct MDPs for which these two notions of regret differ significantly. E.g., set the discount factor = 0. Then any policy which maximizes immediate rewards achieves 0 regret in the notion of Strehl and Littman. But such a policy may not move to states where the optimal reward is obtained. 1 2 Tewari and Bartlett [7] propose a generalization of the index policies of Burnetas and Katehakis [11]. These index policies choose actions optimistically by using confidence bounds only for the estimates in the current state. The regret bounds for the index policies of [11] and the OLP algorithm of [7] are asymptotically logarithmic in T . However, unlike our bounds those bounds also depend on the gap between the "quality" of the best and the second best action, and hide a term exponential in the number of states. Actually, it is possible to prove a corresponding gap-dependent logarithmic bound for our U C R L 2 algorithm as well (cf. Remark 4 below). This bound holds uniformly over time and under weaker assumptions: While [7] and [11] consider only ergodic MDPs in which any policy will reach every state after a sufficient number of steps, we make only the more natural assumption of a finite diameter. 1.2 Results We summarize the results achieved for our algorithm U C R L 2 (see Section 2) and also state a corresponding lower bound. We assume an unknown MDP M to be learned, with finite diameter D := D(M ), S := |S | states, and A := |A| actions. Only S and A are known to the learner. Theorem 2. For any initial state s S and any T 1, with probability 1 - /T the regret of U C R L 2 is bounded by T (M , U C R L 2, s, T ) c1 · DS A log T , for a constant c1 which is independent of M . It is straightforward to obtain from Theorem 2 the following sample complexity bound. Corollary 3. With probability the average per-step regret is at most after D s SA D2 S 2 A log T c1 2 teps, where c1 is a constant independent of M . Remark 4. The proof method of Theorem 2 can be modified to give for each initial state s an alternative upper bound of , D2 2 S A log T E [(M , U C R L 2, s, T )] = O g where g := (M ) - max,s {(M , , s) : (M , , s) < (M )} is the gap between the optimal average reward and the second best average reward achievable in M . This bound can be derived by considering the number L of suboptimal steps taken by U C R L 2. Analogously to Theorem 2 an upper bound of O(DS LA log T ) on the accumulated regret can be shown. As the loss in each suboptimal step is at least g , one has g L = O(DS LA log T ), which gives D2 2 . S A log T L=O g2 A refined analysis of the regret in each suboptimal step improves the exponent of g and yields the claimed bound, as the regret of each suboptimal step is bounded by 1. See Appendix C of [12]. These new bounds are improvements over the bounds that have been achieved in [5] for the original UCRL algorithm in various respects: the exponents of the relevant parameters have been decreased considerably, the parameter D we use here is substantially smaller than the corresponding mixing time in [5], and finally, the ergodicity assumption is replaced by the much weaker and more natural assumption that the MDP has finite diameter. The following is an accompanying lower bound on the expected regret. Theorem 5. For any algorithm A and any natural numbers T , S , A > 1 and D logA S there is an MDP M with S states, A actions, diameter5 D, such that for any initial state s S the expected regret of A after T steps is . E [(M , A, s, T )] = DS AT 5 The diameter of any MDP with S states and A actions is at least logA S . 3 In a different setting, a modification of U C R L 2 is also able to deal with changing MDPs. Remark 6. Assume that the MDP (i.e. its transition probabilities and reward distributions) is allowed to change k times up to step T , such that the diameter is always at most D. Restarting U C R L 2 at steps i3 /3 for i = 1, 2, 3 . . ., the respective regret is upper bounded by (for constants c1 , c2 ) A c T2 3 log T 1 · k + c2 · D S with probability 1 - . For details see Appendix D of [12]. MDPs with c nging rewards have already been considered in [13], where (best possible) upper ha bounds of O( T ) are derived. Unlike in the setting considered in Remark 6, in [13] the learner is assumed to know the transition probabilities, while the rewards are allowed to change at every step. 2 The U C R L 2 Algorithm Our algorithm is a variation of the UCRL algorithm of [5]. As its predecessor it implements the paradigm of "optimism in the face of uncertainty". As such, it defines a set M of plausible MDPs ~ (according to suitable confidence regions). Among these it chooses an optimistic MDP M (with ~. respect to the average reward) and executes a policy which is (nearly) optimal on M ~ More precisely, U C R L 2 (Figure 1) proceeds in episodes and computes a new policy k only at ~ the beginning of each episode k . The lengths of the episodes are not fixed a priori, but depend on the observations made (see below). In Steps 2­3, U C R L 2 computes estimates pk (s |s, a) and ^ rk (s, a) for the transition probabilities and rewards from the observations made up to episode k . In ^ Step 4, a set Mk of plausible MDPs is defined in terms of confidence regions around the estimated rewards rk (s, a) and transition probabilities pk (s |s, a). This guarantees that with high probability ^ ^ the true MDP M is in Mk . In Step 5, extended value iteration (cf. Section 2.1) is used to choose ~ a near-optimal policy k on an optimistic MDP M Mk . This policy k is executed throughout ~ ~ episode k (Step 6). Episode k ends when a state s is visited in which the action a = k (s) induced ~ by the current policy has been chosen in episode k equally often as before episode k . Thus the total number of visits to any state-action pair at most doubles during an episode. The count vk (s, a) keeps track of the visits to (s, a) in episode k .6 2.1 Extended Value Iteration The standard value iteration algorithm (see e.g. [4]) identifies a near-optimal policy by finding the optimal action in each state for a given set of state values, where the state values are updated until a convergence criterion is satisfied. In our setting in addition to finding a maximizing action in each state, for each action we also need to determine the optimal choice of probability and reward ~ distributions from the associated confidence regions (which corresponds to the choice of Mk ). This can be done by an extended variant of the standard value iteration algorithm. Unlike in standard value iteration, in addition to maximizing over the actions, for each action one has to maximize ~ over the admissible probability distributions. The rewards in Mk are simply set to the maximal admissible value according to (1). Since the maximization in value iteration is a linear problem, only a finite set of probability distributions has to be considered for each action. Thus only little computational overhead compared to standard value iteration is needed by our extended variant. ~~ It can be shown that this procedure indeed identifies a near-optimal MDP-policy pair (Mk , k ). Further, convergence to a policy k with state independent average reward k can be guaranteed. ~ ~ For details see Appendix A of [12]. 3 Intuition behind the Analysis Due to space constraints, in the following we provide just some intuition about the main steps of the proof. Details and the formal proof are given in [12]. 6 Since the policy k is fixed for episode k, vk (s, a) = 0 only for a = k (s). Nevertheless, we find it ~ ~ convenient to use a notation which explicitly includes the action a in vk (s, a). 4 Input: A confidence parameter (0, 1]. Initialization: Set t := 1, and observe the initial state s1 . For episodes k = 1, 2, . . . do Initialize episode k : 1. Set the start time of episode k , tk := t. 2. For all (s, a) in S × A initialize the state-action counts for episode k , vk (s, a) := 0. Further, set the state-action counts prior to episode k , Nk (s, a) := # { < tk : s = s, a = a} . 3. For s, s S and a A set the observed accumulated rewards and the transition counts prior to episode k , Rk (s, a) := tk -1 =1 r 1s =s,a =a , Pk (s, a, s ) := # { < tk : s = s, a = a, s +1 = s } , Rk , Pk ( , s and compute estimates rk (s, a) := max{1,(ska) ,a)} , pk (s |s, a) := max{1,sNa,(s,a)} . ^ ^ N (s k Compute policy k : ~ 4. Let Mk be the set of all MDPs with states and actions as in M , and with transition probabilities p (·|s, a) close to pk (·|s, a), and rewards r(s, a) [0, 1] close to ~ ^ ~ rk (s, a), that is, ^ 3 ~ s log(2S Atk / ) r(s, a) - rk , a ^ and (1) max{1,Nk (s,a)} 1 ~· - · 1 2S log(2Atk / ) p |s, a pk |s, a ^ (2) max{1,Nk (s,a)} . ) 5. Set k := 1/ tk and use extended value iteration (see Section 2.1) to find ~ ~~ a policy k and an optimistic MDP Mk Mk such that (Mk , k , s) ~ maxM Mk ,,s (M , , s ) - k for all s S . Execute policy k : ~ 6. While vk (st , k (st )) < max{1, Nk (st , k (st ))} do ~ ~ (a) Choose action at := k (st ), obtain reward rt , and observe next state st+1 . ~ (b) Update vk (st , at ) := vk (st , at ) + 1. (c) Set t := t + 1. Figure 1: The U C R L 2 algorithm. 3.1 The Regret of a Single Episode The main source of regret is the amount by which the average reward of the optimistically chosen policy is overestimated. To see this, consider a single episode k with its optimistic policy k which ~ ~ is (almost) optimal for the optimistically assumed MDP Mk . Let k be the average reward of k ~ ~ ~ ~ in Mk . Then by the choice of k and Mk in Step 5 of the algorithm and assuming that all confidence ~ intervals hold so that the true MDP M Mk ,7 k - k . ~ Thus the regret k of our algorithm during episode k is bounded as tk+1 -1 tk+1 -1 tk+1 -1 k := 7 t =tk ( - rt ) t =tk (k - rt ) + ~ t =tk k . It can be showthat this holds with high probability, so that the regret due to episodes where M Mk is n upper bounded by T with high probability. See Appendix B.1 of [12] for details. 5 The sum of k 's is small and will not be considered any further in this proof sketch.8 Thus we need to bound the difference between the optimistic average reward k and the actual rewards rt . ~ Ignoring the random fluctuation of the rewards (which is also relatively small and can be bounded easily9 ), the rewards obtained can be written as the sum over the state-action pairs observed during this episode, multiplied by their average rewards, tk+1 -1 t =tk rt ( s,a) vk (s, a) r(s, a) . ¯ Replacing the true mean rewards r(s, a) by their optimistic estimates rk (s, a), we get ¯ ~ ( ~ +( ~ . vk (s, a) k - rk (s, a) ~ vk (s, a) rk (s, a) - r(s, a) ¯ k s,a) s,a) (3) The second term in (3) is bounded by the widths of the confidence intervals,10 ( ~ vk (s, a) rk (s, a) - r(s, a) ¯ s,a) ( 4) ( = s,a) ~ +( vk (s, a) rk (s, a) - rk (s, a) ^ s,a) ^ vk (s, a) rk (s, a) - r(s, a) ¯ ( s,a) vk (s, a) · 2 · log(2S AT / ) max{1,Nk (s,a)} 3 . The first term of the right hand side in (3) would be small, if the observed state-action pairs were ~ indeed generated by the assumed optimistic MDP Mk (then the counts vk (s, a) would be approxi11 mately proportional to the stationary distribution of the policy k ). But since the observations are ~ generated by k on the true MDP, bounding this first term is the crucial part of the proof. ~ 3.2 The Poisson Equation We proceed by using the Poisson equation (a result from MDP theory, see e.g. [4]) for the optimistic ~ s ~ ~ ~ MDP Mk and the policy k . Let P k := pk (s |s, as ) ,s be the transition matrix of k on Mk , and ~ ~ ~s s ~ let r k := rk , k (s) ~ be the (column) vector of the optimistic mean rewards. Then the Poisson equation ~ ~~ ~ k = r k - k 1 + P k k ~ (5) ~ ~ ~ defines the bias k (s) of state s: Intuitively the difference k (s) - k (s ) measures the advantage (or disadvantage) in the long run of starting in state s instead of state s . While in the long run the average per-step rewards are equal for starting in any state, the advantage measures the difference vs s of the total sums of rewards. Let v k := k , k (s) ~ be the (row) vector of visit counts for each state and the corresponding action chosen by k . Plugging the Poisson equation into the first term ~ on the right hand side of (3) we get ( ~ = ~ ~ vk (s, a) k - rk (s, a) ~ v k P k - I k . (6) s,a) As a main fact, we use in our analysis that any advantage is bounded by the diameter, ~ ~ k (s) - k (s ) D Indeed, the sum over all k over all T steps is upper bounded by v (s, a)/ max{1, Nk (s, a)}, a ,a k term that needs to be analyzed at the end of Section 3.3 further below (which is done in detail in Appendix B.6 of [12]. 9 Chernoff bounds show that the respective error is at most of order T with high probability. See Appendix B.2 of [12] for details. 10 Here and in the following, we sloppily do not mention that statements only hold with high probability. 11 The stationary distribution of a policy in an MDP M basically gives for each state s the probability of being in state s after an infinite number of steps, when playing . 8 k s 6 ~ for the chosen policy k on Mk . This follows from the observation that no state can be worse than ~ ~ ~ any other state by more than D. Otherwise, if k (s) - D > k (s ), an improved policy that moves from s to s in at most D steps can be shown to earn more than the average reward of k , which ~ contradicts the optimality of k . Actually, the Poisson equation (5) and hence (6) hold not only for ~ ~ ~ the bias vector k but for all vectors of the form z = k + c1. Hence, for a suitable vector z k with 12 z k D/2 we have ( ~ = ~ z vk (s, a) k - rk (s, a) ~ vk P k - I k . (7) s,a) 3.3 Completing the Proof Now we expand ~ z vk P k - I k ~ z ~ z P z = vk P k - P k + P k - I k = vk P k - P k k + vk k - I k , where P k is the transition matrix of the policy k in the true MDP. The first term can be bounded ~ using the confidence bounds of Step 4 (similar to (4)): v~ ~ z 1 vk P k - P k k = zk k Pk - Pk 1 s s · l (2AT (8) vk , k (s) 2 · m2S {ogNk (s,/) · D . ~ ax 1, a) } 2 The intuition for the second term is that we may interpret v k as a distribution over the states. Then v k P k corresponds to the distribution after making one step according to the transition matrix P k , when starting in the distribution corresponding to v k . As v k is the vector of state visits under transition matrix P k , this one step can be shown not to make too much difference, i.e. v k P k - v k is "small". Using that -- as we have seen above -- entries in z k differ by at most D, this yields an upper bound on v k (P k - I )z k . To turn these intuitions into a concrete bound, one has to sum up over all episodes to find that (see Appendix B.4 of [12] for details) 2 k v k (P k - I )z k D T log T + D · #(episodes). As the number of episodes is logarithmic in T (see Appendix B.5 of [12]) this term is negligible compared to the main term. Thus summing up (4), (7), and (8) we find that 1 ( S log(AT / ) vk (s, a) Nk (s,a) . k const · D s,a) k Finally, summing over the episodes gives (setting N (s, a) := vk (s, a)) k k( S v k (s,a) k = const · D log(AT / ) · s,a) Nk (s,a) = const · D const · D const · D S S S log(AT / ) · log(AT / ) · log(AT / ) · ( s,a) k v k (s,a) Nk (s,a) (9) ( s,a) N (s, a) AS T , where details for the last two steps can be found in Appendix B.6 of [12]. Noting that Theorem 2 (with c1 1) holds trivially for T S A this completes the proof. 12 ~ More precisely, z k is not defined via the bias vector k of k but via the state value vector of value ~ ~ iteration. However, when value iteration converges, z k basically converges to the bias vector k . For details see Lemma 7 in Appendix B.3 of [12]. 7 4 The Lower Bound (Proof Sketch for Theorem 5) We first consider an MDP with two states, S = {0, 1}, and A actions. For each a A, let r(0, a) = 0, r(1, a) = 1, and p (0|1, a) = . For all but a single action a A let p (1|0, a) = whereas p (1|0, a ) = + with < . The diameter of this MDP is D = 1/ . The average reward 1 of a policy which chooses action a in state 0 is 2+ > 2 , while the average reward of any other + 1 policy is 2 . Thus the regret of a suboptimal policy after T seps ist(T / ). To detect the better t action a reliably, any action in state 0 needs to be probed /2 imes. If we consider k := S/2 copies of this MDP, where only one copy has such a better action a , t then in each 0-state all actions need to be probed /2 imes. Setting = k A/T this takes S (T ) time and yields ( AT / ) = ( DS AT ) regret (for details see Appendix E of [12]). Finally, connecting the 0-states of the k copies by A + 1 additional deterministic actions per state, gives an MDP with diameter D + 2 logA S (e.g. by using an A-ary tree) and the theorem follows. 5 Open Problems There is still a gap between the upper bound on the regret of Theorems 2 and the lower bound of Theorem 5. We conjecture that the lower bound gives the right exponents for the parameters S and D. Acknowledgments This work was supported in part by the Austrian Science Fund FWF (S9104-N13 SP4). The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreements n 216886 and n 216529. This publication only reflects the authors' views. References [1] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. [2] Michael J. Kearns and Satinder P. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. In Proc. 11th NIPS. MIT Press, 1999. [3] Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial time. Mach. Learn., 49:209­232, 2002. [4] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1994. [5] Peter Auer and Ronald Ortner. Logarithmic online regret bounds for reinforcement learning. In Proc. 19th NIPS, pages 49­56. MIT Press, 2006. [6] Ronen I. Brafman and Moshe Tennenholtz. R-max ­ a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., 3:213­231, 2002. [7] Ambuj Tewari and Peter Bartlett. Optimistic linear programming gives logarithmic regret for irreducible mdps. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1505­1512. MIT Press, Cambridge, MA, 2008. [8] Sham M. Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, 2003. [9] Alexander L. Strehl and Michael L. Littman. A theoretical analysis of model-based interval estimation. In Proc. 22nd ICML 2005, pages 857­864, 2005. [10] Alexander L. Strehl and Michael L. Littman. An analysis of model-based interval estimation for Markov decision processes. preprint, 2006. [11] Apostolos N. Burnetas and Michael N. Katehakis. Optimal adaptive policies for Markov decision processes. Math. Oper. Res., 22(1):222­255, 1997. [12] Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. http://institute.unileoben.ac.at/infotech/publications/ucrl2.pdf. [13] Eyal Even-Dar, Sham M. Kakade, and Yishay Mansour. Experts in a Markov decision process. In Proc. 17th NIPS, pages 401­408. MIT Press, 2004. 8