The Many Faces of Optimism: a Unifying Approach Istv´n Szita a Andr´s Lrincz a o E¨tv¨s Lor´nd University, 1117 P´zm´ny P. s´t´ny 1/C, Budapest, Hungary oo a aa ea szityu@gmail.com andras.lorincz@elte.hu Abstract The exploration-exploitation dilemma has been an intriguing and unsolved problem within the framework of reinforcement learning. "Optimism in the face of uncertainty" and model building play central roles in advanced exploration methods. Here, we integrate several concepts and obtain a fast and simple algorithm. We show that the proposed algorithm finds a near-optimal policy in polynomial time, and give experimental evidence that it is robust and efficient compared to its ascendants. a number of other methods on some benchmark problems. Our results are summarized in Section 6. In the rest of this section, we review the necessary preliminaries, Markov decision processes and the exploration task. 1.1. Markov Decision Pro cesses (MDPs) Markov decision processes are the standard framework for RL, and the basis of numerous extensions (like continuous MDPs, partially observable MDPs or factored MDPs). An MDP is characterized by a quintuple (X, A, R, P, ), where X is a finite set of states; A is a finite set of possible actions; R : X ×A×X PR is the reward distribution, R(x, a, y ) denotes the mean value of R(x, a, y ), P : X × A × X [0, 1] is the transition function; and finally, [0, 1) is the discount rate on future rewards. We shall assume that all rewards are 0 nonnegative and bounded from above by Rmax . A (stationary) policy of the agent is a mapping : X × A [0, 1]. For any x0 X , the policy of the agent and the parameters of the MDP determine a stochastic process experienced by the agent through the instantiation x0 , a0 , r0 , x1 , a1 , r1 , . . . , xt , at , rt , . . . The goal is to find a policy that maximizes the expected value of the discounted total reward. Let us define the state-action value function (value functionx for short) of as a Q (x, a) := E t rt = x0 , a = a0 nd the t=0 optimal value function as Q (x, a) := max Q (x, a) 1. Introduction Reinforcement learning (RL) is the art of maximizing long-term rewards in a stochastic, unknown environment. In the construction of RL algorithms, the choice of exploration strategy is of central significance. We shall examine the problem of exploration in the Markov decision process (MDP) framework. While simple methods like -greedy and Boltzmann exploration are commonly used, it is known that their behavior can be extremely poor (Koenig & Simmons, 1993). Recently, a number of efficient exploration algorithms have been published, and for some of them, formal proofs of efficiency also exist. We review these methods in Section 2. By combining ideas from several sources, we construct a new algorithm for efficient exploration. The new algorithm, optimistic initial model (OIM), is described in Section 3. In Section 4, we show that many of the advanced algorithms, including ours, can be treated in a unified way. We use this fact to sketch a proof that OIM finds a near-optimal policy in polynomial time with high probability. Section 5 provides experimental comparison between OIM and Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). for each (x, a) X × A. Let the greedy action at x w.r.t. value function Q be aQ := arg maxa Q(x, a). x The greedy policy of Q deterministically takes the greedy action in each state. It is well-known that the greedy policy of Q is an optimal policy and Q satisfies the Bellman equations: R . y P (x, a, y ) (x, a, y ) + Q (y , aQ ) Q (x, a) = y The Many Faces of Optimism: a Unifying Approach 1.2. The Exploration Problem In the classical reinforcement learning setting, it is assumed that the environment can be modelled as an MDP, but its parameters (that is, P and R) are unknown to the agent, and she has to collect information by interacting with the environment. If too little time is spent with the exploration of the environment, the agent will get stuck with a suboptimal policy, without knowing that there exists a better one. On the other hand, the agent should not spend too much time visiting areas with low rewards and/or accurately known parameters. What is the optimal balance between exploring and exploiting the acquired knowledge and how could the agent concentrate her exploration efforts? These questions are central for RL. It is known that the optimal exploration policy in an MDP is non-Markovian, and can be computed only for very simple tasks like k armed bandit problems. 2.2. Optimistic Initial Values (OIV) One may boost exploration with a simple trick: the initial value of each state action pair can be set to some overwhelmingly high number. If a state x is visited often, then its estimated value will become more exact, and therefore, lower. Thus, the agent will try to reach the more rarely visited areas, where the estimated state values are still high. This method, called `exploring starts' or `optimistic initial values', is a popular exploration heuristic (Sutton & Barto, 1998), sometimes combined with others, e.g., the -greedy exploration method. Recently, Even-Dar and Mansour (2001) gave theoretical justification for the method: they proved that if the optimistic initial values are sufficiently high, Q-learning converges to a near-optimal solution. One apparent disadvantage of OIV is that if initial estimations are too high, then it takes a long to fix them. 2.3. Bayesian Metho ds We may assume that the MDP (with the unknown values of P and R) is drawn from a parameterized distribution M0 . From the collected experience and the prior distribution M0 , we can calculate successive posterior distributions Mt , t = 1, 2, . . . by Bayes' rule. Furthermore, we can calculate (at least in principle) the policy that minimizes the uncertainty of the parameters (Strens, 2000). Dearden (2000) approximates the distribution of state values directly. Exact computation of the optimal exploration policy is infeasible and Bayesian methods are computationally demanding even with simplifying assumptions about the distributions, e.g., the independencies of certain parameters. 2.4. Confidence Interval Estimation Confidence interval estimation algorithms are between Bayesian exploration and OIV. It assumes that each state value is drawn from an independent Gaussian distribution and it computes the confidence interval of the state values. The agent chooses the action with the highest upper confidence bound. Initially, all confidence intervals are very wide, and shrink gradually towards the true state values. Therefore, the behavior of the technique is similar to OIV. The IEQL+ method of Meuleau and Bourgine (1999) directly estimates confidence intervals of Q-values, while Wiering and Schmidhuber (1998) calculate confidence intervals for P and R, and obtain Q-value bounds indirectly. Strehl and Littman (2006) improve the method and prove a polynomial-time convergence bound. Both algorithms are called model-based interval estimation. To avoid 2. Related Literature Here we give a short review about some of the most important exploration methods and their properties. 2.1. -greedy and Boltzmann Exploration The most popular exploration method is -greedy action selection. The method works without a model, only an approximation of the action value function Q(x, a) is needed. The agent in state x selects the greedy action aQ or an explorative move with a ranx dom action with probabilities 1 - and , respectively. Sooner or later, all paths with nonzero probability will have been visited many times, so, a suitable learning algorithm can learn to choose the optimal path. It is known, for example, that Q-learning with nonzero exploration converges to the optimal value function with probability 1 (Littman & Szepesv´ri, 1996), and so a does SARSA (Singh et al., 2000), if the exploration rate diminishes according to an appropriate schedule. Boltzmann exploration selects actioa s asQfollows: the n exp (s,a)/T Q , probability of choosing action a is ) A exp (s,a /T where `temperature' T (> 0) regulates the amount of explorative actions. Convergence results of the greedy method carry through to this case. Unfortunately, for the -greedy and the Boltzmann method, exploration time may scale exponentially in the number of states (Koenig & Simmons, 1993). The Many Faces of Optimism: a Unifying Approach confusion, we will refer to them as MBIE(WS) and MBIE(SL). Auer and Ortner (2006) give a confidence intervalbased algorithm, for which the online regret is only logarithmic in the number of steps taken. 2.5. Exploration Bonus Metho ds The agent can be directed towards less-known parts of the state space by increasing the value of `interesting' states artificially with bonuses. States can be interesting given their frequency, recency, error, etc. (Meuleau & Bourgine, 1999; Wiering & Schmidhuber, 1998). The balance of exploration and exploitation is usually set by a scaling factor , so that the total immediate reward of the agent at time t is rt + · bt (xt , at , xt+1 ), where bt is one of the above listed bonuses. The bonuses are calculated by the agent and act as intrinsic motivating forces. Exploration bonuses for a state can vary swiftly and model-based algorithms (like prioritized sweeping or Dyna) are used for spreading the changes effectively. Alas, the weight of exploration needs to be annealed according to a suitable schedule. Alternatively, the agent may learn two value functions separately: a regular one, Qr which is based on the t rewards rt received from the environment, and an exploration value function Qe which is based on the ext ploration bonuses. The agent's policy will be greedy with respect to their combination Qr + Qe . Then t t the exploration mechanism may remain the same, but several advantages appear. First of all, the changes in take effect immediately. As an example, we can immediately switch off exploration by setting to 0. Furthermore, Qr may converge even if Qe does not. t t Confidence interval estimation can be phrased as an exploration bonus method: see IEQL+ (Meuleau & Bourgine, 1999) or MBIE-EB (Strehl & Littman, 2006). Even-Dar and Mansour (2001) have shown that -greedy and Boltzmann explorations can be formulated as exploration bonus methods although rewards are not propagated through the Bellman equations. 2.6. E 3 and R-max The Explicit explore or exploit (E 3 ) algorithm of Kearns and Singh (1998) and its successor, R-max (Brafman & Tennenholtz, 2001) were the first algorithms that have polynomial time bounds for finding near-optimal policies. R-max collects statistics about transitions and rewards. When visits to a state enable high precision estimations of real transition probabilities and rewards then state is declared known. R-max also maintains an approximate model of the environ- ment. Initially, the model assumes that all actions in all states lead to a (hypothetical) maximum-reward absorbing state. The model is updated each time when a state becomes known. The optimal policy of the model is either the near-optimal policy in the real environment or enters a not-yet-known state and collects new information. 3. Construction of the Algorithm Our agent starts with a simple, but overly optimistic model. By collecting new experiences, she updates her model, which becomes more realistic. The value function is computed over the approximate model with (asynchronous) dynamic programming. The agent always chooses her action greedily w.r.t. her value function. Exploration is induced by the optimism of the model: unknown areas are believed to yield large rewards. Algorithmic components are detailed below. Separate exploration values. Similarly to the approach of Meuleau and Bourgine (1999), we shall separate the `true' state values from exploration values. Formally, the value function has the form Q(x, a) = Qr (x, a) + Qe (x, a) for all (x, a) X × A, where Qr and Qe will summarize external and exploration rewards, respectively. `Garden of Eden' state. Similarly to R-max, we introduce a new hypothetical `garden of Eden' state xE , and assume an extended state space X = X {xE }. Once there, then, according to the inherited model, the agent remains in xE indefinitely and receives Rmax reward for every step, which may ex0 ceed Rmax =: maxx,a,y R(x, a, y ), the maximal reward of the original environment. Mo del approximation. The agent builds an approximate model of the environment. For each x, y X and a A, let Nt (x, a), Nt (x, a, y ), and Ct (x, a, y ) denote the number of times when a was selected in x up a to step t, the number of times when transition x y was experienced, and the sum of external rewards for a x y transitions, respectively. With these notations, the approximate model parameters are Ct (x, a, y ) Nt (x, a, y ) ^ ^ and Rt (x, a, y ) = . Pt (x, a, y ) = Nt (x, a) Nt (x, a, y ) Suitable initializations of Nt (x, a), Nt (x, a, y ) and Ct (x, a, y ) will ensure that the ratios are well-defined everywhere. The exploration rewards are defined as R max , if y = xE ; Re (x, a, y ) := 0, if y = xE , The Many Faces of Optimism: a Unifying Approach for each x, y X {xE }, a A, and are not modified during the course of learning. Optimistic initial mo del. The initial model assumes that xE has been reached once for each stateaction pairs: for each x X {xE }, y X and a A, N0 (x, a) = 1, N0 (x, a, y ) = 0, C0 (x, a, y ) = 0. N0 (x, a, xE ) = 1, C0 (x, a, xE ) = 0. Then, the optimal initial value function equals Q0 (x, a) = Qr (x, a)+Qe (x, a) = 0+ 0 0 for each (x, a) X × 4.1. Relationship to Other Metho ds `Optimism in the face of uncertainty' is a common point in exploration methods: the agent believes that she can obtain extra rewards by reaching the unexplored parts of the state space. Note that as far as the combined value function Q is concerned, OIM is an asynchronous dynamic programming method augmented with model approximation. Optimistic initial values. Apparently, OIM is the model-based extension of the OIV heuristic. Note however, that optimistic initialization of Q-values is not effective with a model: the more updates are made, the less effect the initialization has and it fully diminishes if value iteration is run until convergence. Therefore, naive combination of OIV and model construction is contradictory: the number of DP-updates should be kept low in order to save the initial boost, but it should be as high as possible in order to propagate the real rewards quickly. OIM resolves this paradox by moving the optimism into the model. The optimal value function of the initial model is Q0 Vmax , corresponding to OIV. However, DP updates can not, but only model updates may lower the exploration boost. Note that we can set the initial model value as high as we like, but we do not have to wait until the initial boost diminishes, because Qr and Qe are separated. R-max. The `Garden of Eden' state xE of OIM is identical to the fictitious max-reward absorbing state of R-max (and E 3 ). In both cases, the agent's model tells that all unexplored (x, a) pairs lead to xE . R-max, however, updates the model only when the transition probabilities and rewards are known with high precision, which is only after many visits to (x, a). In contrast, OIM updates the model after each single visit, employing each bit of experience as soon as it is obtained. As a result, the approximate model can be used long before it becomes accurate. Exploration b onus metho ds. The extra reward offered by the Garden of Eden state can be understood as an exploration bonus: for each visit of the pairV (x, a), the agent gets the bonus bt (x, a) = . 1 It is insightful to contrast max - Qt (x, a) Nt (x,a) this formula with those of the other methods like the frequency-based bonus bt = - · Nt (x, a) or. the errorQ based bonus bt = · t+1 (x, a) - Qt (x, a) Mo del-based interval exploration. The exploration bonus form of the MBIE method of Strehl and Littman (2005) sets bt = Nt ( ,a) . MBIE-EB is not x an ad-hoc method: the form of the bonus comes from 1 Rmax := Vmax 1- A, analogously to OIV. Dynamic programming. Both value functions can be updated using the approximate model. For each x X , let ax be the greedy action according to the combined value function, i.e., Q . ax := arg max r (x, a) + Qe (x, a) a A The dynamic programming equations for the value function components are Q ^ y ^ Pt (x, a, y ) Rt (x, a, y ) + Qr (y , ay ) Qr+1 (x, a) := t t X e t+1 (x, a) := y ^ Pt (x, a, y )Qe (y , ay ) t X ^ + Pt (x, a, xE )Vmax . Episodic tasks can be handled as usual way; we introduce an absorbing final state with 0 external reward. Asynchronous up date. The algorithm can be online, if instead of full update sweeps over the state space updates are limited to state set Lt in the `neighborhood' of the agent's current state. Neighborhood is restricted by computation time constraints; any asynchronous dynamic programming algorithm suffices. It is implicitly assumed that the current state is always updated, i.e., xt Lt . In this paper, we used the improved prioritized sweeping algorithm of Wiering and Schmidhuber (1998). Putting it all together. The method is summarized as Algorithm 1. 4. Analysis In the first part of this section, we analyze the similarities and differences between various exploration methods, with an emphasis on OIM. Based on this analysis, we sketch the proof that OIM finds a near-optimal policy in polynomial time. Details of the proof can be found in (Szita & Lrincz, 2008). o The Many Faces of Optimism: a Unifying Approach Algorithm 1 The Optimistic initial model algorithm Input: x0 X initial state, > 0 required precision, optimism parameter Rmax Mo del initialization: t := 0; x, y X, a A: N (x, a, y ) := 0, N (x, a, xE ) := 1, N (x, a) := 1, C (x, a, y ) := 0, Qr (x, a) := 0, Qe (x, a) := Rmax /(1 - ); rep eat at := greedy action w.r.t. Qr + Qe ; apply at and observe rt and xt+1 C (xt , at , xt+1 ) := C (xt , at , xt+1 ) + rt ; N (xt , at , xt+1 ) := N (xt , at , xt+1 ) + 1; N (xt , at ) := N (xt , at ) + 1 Lt := list of states to be updated for each x Lt do Q ^ y r ^ Qr+1 (x, a) := t X P (x, a, y ) R(x, a, y ) + Qt (y , ay ) y e e ^ ^ t+1 (x, a) := P (x, a, xE )Rmax /(1 - ) + X P (x, a, y )Qt (y , ay ). end for t := t + 1 c until Bellman-error> onfidence interval estimations. The comparison to MBIE-EB will be especially valuable, as it converges in polynomial-time and the proof can be transported to OIM with slight modifications. 4.2. Polynomial-time Convergence Theorem 4.1 There exists a constant C so that for any > 0, > 0, 1 := /4, 2 := 1(1 - ), 2 1 /6 0 Rmax C |X |5 |A|H 4 1 H := 1- ln 1(1-) , K := ln 1 , 1 4 OIM converges almost surely to a near-optimal policy in polynomial time if started with 3 0 0 Rmax 2(1-) (Rmax )2 ln(K Rmax ), that is, with probability 1 - , the number of timesteps where OI M Q ( (xt , at ) > Q (xt , a.t ) - does not hold, is 0 |A|(Rm )7 | O |X1H )5 (1- )ax ln2 1 5 1 0 where := Rmax /(1 - ) l 3 0 0 n(K Rmax ). 1- Rmax l n(2 |X | |A| m/ )/2 = We will show that Rmax /(Nt (x, a)(1 - )) + 2 / N t (x, a). (2) Rm For Nt (x, a) (1-a)x 2 , the first term dominates the l.h.s. and we can omit the second term (and prove the stricter inequality). If the relation is reversed, then the first term can be omitted. In both cases, we arrive at 3 0 0 the requirement Rmax 2(1-) (Rmax )2 ln(K Rmax ), which holds by assumption. For the sketch of the proof, we shall follow the technique of Kearns and Singh (2002) and Strehl and Littman (2006), and will use the shorthands [KS] and [SL] for referring to them. See (?) for the detailed proof with a slightly better polynomial bound. A pair (x, a) is declared known, if it has been visited at | 0 least m = C ( |X1H )4 (Rmax )6 ln 1 times, with a suitable constant C . OIM preserves the optimism of the value function: Lemma 4.2 Let Qt be the sequence of Q-functions generated by OIM. Then, it holds with probability 1 - /2 that for any t, Qt (x, a) Q (x, a) - 1. Proof: According to [SL], with probability 1 - /2, y ^ ( ^ Pt (x, a, y ) Rt (x, a, y ) + V (y ) 1) -Q (x, a) - / N t (x, a), At step t, a number of DP updates are carried out. We proceed by induction on the number of DP-updates. Initially, Q(0) (x, a) Q (x, a)- 1, then+ (i+1) (x, a) = Q ^ y^ Vmax Pt (x, a, y ) Rt (x, a, y ) + V (i) (y ) Nt (x,a) ^ + Vmax y^ Pt (x, a, y ) Rt (x, a, y ) + (V (y ) - 1) Nt (x,a) N Vmax Q (x, a) - / t (x, a) - 1 + Nt (x,a) Q (x, a) - 1 - 2 = Q (x, a) - 1, where we applied (1), (2), the induction assumption and the definition of 2. L ^ et M denote the true (and unknown) MDP, let M ¯ be the approximate model of OIM, and define M so that it is identical to M for known pairs, and ^ ^ equals M for unknown pairs. The parameters of M ¯ are nearly identical: if (x, a) becomes known, and M ^ ^ then the local values of P and R are O( |X |H R0 )2 max approximations of P and R with probability 1 - /2 (Lemma 5 of [KS]). Therefore, by Lemma 4 of [KS], |Q^ (x, a) - Q¯ (x, a)| < 1. M M (3) Define the H -step truncated value function of policy x H . t as Q (x, a, H ) := E = x0 , a = a0 t=0 rt According to [KS] Lemma 2, H = 1 1- ln 0 Rmax 1(1-) is an The Many Faces of Optimism: a Unifying Approach 1-horizon time, i.e., |QM (x, a, H ) - Q (x, a)| < 1 for M any (x, a), and any MDP M with discount factor . Table 1. Results on the RiverSwim task. Method Cumulative reward Consider a state-action pair (x1 , a1 ) and a H -step long tra jectory generated by . Let AM be the event that an unknown pair (x, a) is encountered along the trajectory. Then, by Lemma 3 of [SL], Q (x1 , a1 ) Q¯ (x1 , a1 ) - Vmax Pr(AM ). M M (4) E3 R-max MBIE(SL) MBIE-EB OIM 3.020·106 3.014·106 3.168·106 3.093·106 3.201·106 ± 0.027 ·106 ± 0.039 ·106 ± 0.023 ·106 ± 0.023 ·106 ± 0.016 ·106 To conclude the proof, we separate two cases (following the line of thoughts of Theorem 1 in [SL]). In the first case, an exploration step will occur with high prob0 0 ability: Let Vmax := Rmax /(1 - ). Suppose that 0 Pr(AM ) > 1/Vmax , that is, an unknown pair (x, a) is visited in H steps with high probability. This can happen at most m |X | |A| times, so by the HoeffdingAzuma bound, with probability 1 - /2, all (x, a) will 0 m|X ||A|H Vmax become known after O( ln 1 ) exploration 1 steps. 0 On the other hand, if Pr(AM ) 1/Vmax , then the policy is near-optimal with probability 1 - : Method Table 2. Results on the SixArms task. Cumulative reward ± ± ± ± ± 0.244 ·106 0.256 ·106 0.559 ·106 0.587 ·106 0.654 ·106 E3 R-max MBIE(SL) MBIE-EB OIM 1.623·106 2.819·106 9.205·106 9.486·106 10.007·106 Q M OI M (x1 , a1 ) Q M OI M (x1 , a1 , H ) OI M Q¯ M Q¯ M OI M 0 (x1 , a1 , H ) - Vmax Pr(AM ) (x1 , a1 , H ) - 1 QM ¯ OI M possible actions: she can swim either upstream or downstream. Swimming down is always successful, but swimming up succeeds only with a 30% chance and there is a 10% chance of slipping down. The lowermost position yields +5 reward per step, while the uppermost position yields +10000. The SixArms MDP consists of a central state and six `payoff states'. In the central state, the agent can play 6 one-armed bandits. If she pulls arm k and wins, she is transferred to payoff state k . Here, she can get a reward in each step, if she chooses the appropriate action. The winning probabilities range from 1 to 0.01, while the rewards range from 50 to 6000 (for the exact values, see Strehl & Littman, 2006). Data for E 3 , R-max, MBIE and MBIE-EB are taken from Strehl and Littman (2006). Parameters of all four algorithms were chosen optimally. Following a coarse search in parameter space, the Rmax parameter for OIM was set to 2000 for RiverSwim and to 10000 for SixArms. State spaces are small and value iteration instead of prioritized sweeping was completed in each step. On both problems, each algorithm ran for 5000 time steps and the undiscounted total reward was recorded. The averages and 95% confidence intervals are calculated over 1000 test runs (Tables 5.1 and 5.1). 5.2. 50 × 50 Maze with Subgoals Another benchmark problem, MazeWithSubgoals, was suggested by Wiering and Schmidhuber (1998). The agent has to navigate in a 50 × 50 maze from the start position at (2, 2) to the goal (with +1000 reward) at the opposite corner (49, 49). There are sub- (x1 , a1 ) - 2 1 Q^ (x1 , a1 ) - 3 1 Q (x1 , a1 ) - 4 1 M = Q (x1 , a1 ) - , where we applied (in this order) the property that truncation decreases the value function; Eq. (4); our assumption; the 1-horizon property of H ; Eq. (3); Lemma 4.2 and the definition of 1. OI M 5. Exp eriments To assess the practical utility of OIM, we compared its performance to other exploration methods. Experiments were run on several small benchmark tasks challenging exploration algorithms. For fair comparisons, benchmark problems were taken from the literature without changes, nor did we change the experimental settings or the presentation of experimental data. It also means that the presentation format varies for different benchmarks. 5.1. RiverSwim and SixArms The first two benchmark problems, RiverSwim and SixArms, were taken from Strehl and Littman (2006). The RiverSwim MDP has 6 states, representing the position of the agent in a river. The agent has two The Many Faces of Optimism: a Unifying Approach Table 3. Results on the MazeWithSubgoals task. The number of steps required to learn p-optimal policies (p=0.95, 0.99, 0.998) on the 50×50 maze task with suboptimal goals. In parentheses: how many runs out of 20 have found the goal. `k' stands for 1000. Table 4. Average accumulated rewards on the Chain task. Optimal policy gathers 3677. Method QL+var.-bonus QL+err.-bonus QL -greedy QL Boltzmann IEQL+ Bayesian QL Bayesian DP2 OIM Phase 1 ­ ­ 1519 1606 2344 1697 3158 3510 Phase 2 2570 25301 1611 1623 2557 2417 3611 3628 1 Phase 8 ­ ­ 1602 ­ ­ ­ 3643 3643 Method -greedy, = 0.2 -greedy, = 0.4 Recency-bonus Freq.-bonus MBIE(WS) OIM 95% ­ (0) 43k (4) 27k (19) 24k (20) 25k (20) 19k (20) 99% ­ (0) 52k (4) 55k (18) 50k (16) 42k (19) 29k (20) 99.8% ­ (0) 68k (4) 69k (9) 66k (10) 66k (18) 31k (20) optimal goals (with +500 reward) at the other two corners. The maze has blocked places and punishing states (-10 reward), set randomly in 20-20% of the squares. The agent can move in four directions, but with a 10% chance, its action is replaced by a random one. If the agent tries to move to a blocked state, it gets a reward of -2. Reaching any of the goals resets the agent to the start state. In all other cases, the agent gets a -1 reward for each step. Each algorithm was run on 20 different mazes for 100,000 steps. After every 1000 steps, we tested the learned value functions by averaging 20 test runs, in each one following the greedy policy for 10,000 steps, and averaging cumulated (undiscounted) rewards. We measured the number of test runs needed for the algorithms to learn to collect 95%, 99% and 99.8% of the maximum possible rewards in 100,000 steps, and the number of steps this takes on average, if the algorithms can meet the challenge. The algorithms that we compared were the recency based and frequency based exploration bonus methods, two versions of -greedy exploration, MBIE(WS) and OIM. All exploration rules applied the improved prioritized sweeping of Wiering and Schmidhuber (1998). OIM's Rmax was set to 1000. The results are summarized in Table 3. 5.3. Chain, Lo op and FlagMaze The next three benchmark MDPs, the Chain, Loop and FlagMaze tasks were investigated, e.g., by Meuleau and Bourgine (1999), Strens (2000) and Dearden (2000). In the Chain task, 5 states are lined up along a chain. The agent gets +2 reward for being in state 1 and +10 for being in state 5. One of the actions advances one state ahead, the other one resets the agent to state 1. The Loop task has 9 states in two loops (arranged in a 8-shape). Completing the first loop (using any combination of the two actions) yields +1 reward, while the second loop yields +2, but one of the actions resets the agent to the start. The FlagMaze task consists of a 6 × 7 maze with several walls, a start state, a goal state and 3 flags. Whenever the agent reaches the goal, her reward is the number of flags collected. The following algorithms were compared: Q-learning with variance-based and TD error-based exploration bonus (model-free variants), -greedy exploration, Boltzmann exploration, IEQL+, Bayesian Q-learning, Bayesian DP and OIM. Data were taken from Meuleau and Bourgine (1999), Strens (2000) and Dearden (2000). According to the sources, parameters for all algorithms were set optimally. OIM's Rmax parameter was set to 0.5, 10 and 0.005 for the three tasks, respectively. Each algorithm ran for 8 learning phases. The total cumulated reward over each learning phase was measured. One phase lasted for 1000 steps for the first two tasks and 20,000 steps for the FlagMaze task. We carried out 256 parallel runs for the first 2 tasks and 20 for the third one. 6. Summary of the Results We proposed a new algorithm for exploration and reinforcement learning in Markov decision processes. The algorithm integrates concepts from other advanced exploration methods. The key component of our algorithm is an optimistic initial model. The optimal policy according to the agent's model will either explore new information that helps to make the model Results for Phase 5. Augmented with limited amount of pre-wired knowledge (the list of successor states). 2 1 The Many Faces of Optimism: a Unifying Approach Table 5. Average accumulated rewards on the Loop task. Optimal policy gathers 400. general polynomial time algorithm for near-optimal reinforcement learning. Proc. IJCAI (pp. 953­958). Dearden, R. W. (2000). Learning and planning in structured worlds. Doctoral dissertation, University of British Columbia. Even-Dar, E., & Mansour, Y. (2001). Convergence of optimistic and incremental Q-learning. NIPS (pp. 1499­1506). Kearns, M., & Singh, S. (1998). Near-optimal reinforcement learning in polynomial time. Proc. ICML (pp. 260­268). Kearns, M., & Singh, S. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49, 209­232. Koenig, S., & Simmons, R. G. (1993). Complexity analysis of real-time reinforcement learning. Proc. AAAI (pp. 99­105). Littman, M. L., & Szepesv´ri, C. (1996). A genera alized reinforcement-learning model: Convergence and applications. Proc. ICML (pp. 310­318). Morgan Kaufmann. Meuleau, N., & Bourgine, P. (1999). Exploration of multi-state environments: Local measures and backpropagation of uncertainty. Machine Learning, 35, 117­154. Singh, S. P., Jaakkola, T., Littman, M. L., & Szepesv´ri, C. (2000). Convergence results for a single-step on-policy reinforcement-learning algorithms. Machine Learning, 38, 287­308. Strehl, A. L., & Littman, M. L. (2005). A theoretical analysis of model-based interval estimation. Proc. ICML (pp. 856­863). Strehl, A. L., & Littman, M. L. (2006). An analysis of model-based interval estimation for Markov decision processes. Submitted. Strens, M. (2000). A Bayesian framework for reinforcement learning. Proc. ICML (pp. 943­950). Morgan Kaufmann, San Francisco, CA. Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge. Szita, I., & Lrincz, A. (2008). The many faces of optio mism ­ extended version (Technical Report). E¨tv¨s oo Lor´nd University, Hungary. a Wiering, M. A., & Schmidhuber, J. (1998). Efficient model-based exploration. Proc. SAB: From Animals to Animats (pp. 223­228). Method QL+var.-bonus QL+err.-bonus QL -greedy QL Boltzmann IEQL+ Bayesian QL Bayesian DP2 OIM Phase 1 ­ ­ 337 186 264 326 377 393 Phase 2 179 1791 392 200 293 340 397 400 1 Phase 8 ­ ­ 399 ­ ­ ­ 399 400 Table 6. Average accumulated rewards on the FlagMaze task. Optimal policy gathers approximately 1890. Method QL -greedy QL Boltzmann IEQL+ Bayesian QL Bayesian DP2 OIM Phase 1 655 195 269 818 750 1133 Phase 2 1135 1024 253 1100 1763 1169 Phase 8 1147 ­ ­ ­ 1864 1171 more accurate, or follows a near-optimal path. The extent of optimism regulates the amount of exploration. We have shown that with a suitably optimistic initialization, our algorithm finds a near-optimal policy in polynomial time. Experiments were conducted on a number of benchmark MDPs. According to the experimental results our novel method is robust and compares favorably to other methods. Acknowledgments We are grateful to one of the reviewers for his helpful comments. This research has been supported by the EC FET `New Ties' Grant FP6-502386 and NEST `PERCEPT' Grant FP6-043261. Opinions and errors in this manuscript are the author's responsibility, they do not necessarily reflect those of the EC or other pro ject members. References Auer, P., & Ortner, R. (2006). Logarithmic online regret bounds for undiscounted reinforcement learning algorithms. NIPS (pp. 49­56). Brafman, R. I., & Tennenholtz, M. (2001). R-MAX - a