Adaptive Aggregation for Reinforcement Learning with Efficient Exploration: Deterministic Domains Andrey Bernstein Department of Electrical Engineering Technion ­ Israel Institute of Technology Haifa, Israel andreyb@tx.technion.ac.il Nahum Shimkin Department of Electrical Engineering Technion ­ Israel Institute of Technology Haifa, Israel shimkin@ee.technion.ac.il An extra difficulty is added when dealing with learning problems, namely situations where the model of the MDP is initially unknown. Reinforcement Learning (RL) encompasses a wide range of techniques for solving this problem by interacting with the environment. An important part of an RL algorithm is the exploration scheme. The role of exploration is to gain new information by appropriate action selection which directs the agent towards unknown states of the MDP. Recently, efficient learning algorithms were presented and proved to learn nearly optimal behavior (with high probability) within a time or error bound that is polynomial in the problem size. These include the E 3 [13], R-MAX [7], MBIE [18], GCB [8], UCRL [2] and OLP [20] algorithms. These algorithms use efficient exploration techniques, which are often based on the so­called "optimism in face of uncertainty" principle. However, these algorithms are infeasible in cases where the state and/or action spaces are very large or infinite, since their time and space complexity is typically polynomial in the size of the space. On the other hand, most of the existing algorithms for solving "large" problems that rely on state aggregation, are heuristic in nature, without formal guarantees. These include such works on adaptive aggregation as [15], [16] and [5]. An exception is the algorithm proposed by Diuk et al. [10], which uses R-MAX as the basis. However, this algorithm requires a specific structure of the problem; otherwise, its total mistake bound is polynomial in the size of the state space, which can be very large or infinite. In this paper we focus on the online reinforcement learning problem in MDPs with very large or infinite state space, finite action space, discounted return criterion, and with deterministic dynamics and rewards. For concreteness we will focus on the continuous state case; however our schemes and results also apply to the discrete case, where the number of states is very large or countably infinite. The proposed algorithms use an adaptive state aggregation approach, going from coarse to fine grids over the state space, which enables to use finer resolution in the "important" areas of the state space, and coarser resolution elsewhere. We consider an on-line learning approach, in which we discover these important areas on-line, using an uncertainty intervals exploration technique1 . Certain continuity assumptions on the baOur uncertainty intervals are the analogue of the confidence intervals used in the stochastic case [18]. However, the origin of 1 Abstract We propose a model-based learning algorithm, the Adaptive Aggregation Algorithm (AAA), that aims to solve the online, continuous state space reinforcement learning problem in a deterministic domain. The proposed algorithm uses an adaptive state aggregation approach, going from coarse to fine grids over the state space, which enables to use finer resolution in the "important" areas of the state space, and coarser resolution elsewhere. We consider an on-line learning approach, in which we discover these important areas on-line, using an uncertainty intervals exploration technique. Polynomial learning rates in terms of mistake bound (in a PAC framework) are established for this algorithm, under appropriate continuity assumptions. 1 Introduction Markov Decision Processes (MDPs) provide a standard framework for handling control and sequential decision making tasks under uncertainty ([4, 17]). Solid theory and a variety of algorithms enable the efficient computation of optimal control policies in MDPs when the state and action spaces are finite. However, an exact solution becomes intractable when the number of states is large or infinite. In this case, some approximation schemes are required. See [4] and [6] for a thorough discussion. Also, see such recent works as [1, 11, 14]. One natural approximation approach is state aggregation, in which the state space is discretized into a (relatively small) finite collection of cells. Each cell is said to aggregate the states that fall in this cell. Once the aggregation is performed, the new problem is a planning problem in a reduced state space, which can be solved by regular techniques. The main question that arises here is how to perform the aggregation, so that, on the one hand, obtain a "good" approximation of an optimal policy, and on the other hand minimize the problem complexity. This question was addressed by many works, such as [21, 9], which provide formal answers under some continuity assumptions on the model parameters. We would like to thank the reviewers for their useful comments on this paper. sic model parameters will be imposed. Such assumptions are essential for generalization in continuous state space, especially when using the state aggregation approach. The principle that governs our basic scheme is simply to split frequently visited cells. The idea behind this principle is as follows. As time progresses, we will visit cells that are "close" to the optimal trajectory; on the optimal trajectory, we need high resolution. Perhaps surprisingly, this principle is not sufficient to obtain theoretical results. Consequently, we will propose an improved variant of the basic algorithm, for which learning rates in terms of total mistake bound (see below) will be established. In this variant, in addition to splitting the visited cells, we also split cells that the algorithm "could have" visited (according to the uncertainty in the model of the MDP that was learned so far). We will use the total mistake bound as a performance metric for our algorithms. This metric counts the total number of time-steps in which the algorithm's implemented policy is strictly suboptimal from the current state. This metric has been used in a number of recent works on on-line learning in discounted MDP problems2 [12, 18, 19]. In our case we will establish two types of mistake bounds, which we call the prior bound and the posterior bound. The first type ensures that our algorithm is not worse than a non­adaptive algorithm, which uses a single uniformly dense grid. In this case, our mistake bound is thus polynomial in the number of cells in this grid. The second type ensures that the mistake bound is polynomial in the number of cells in the actually used grid. In our analysis, we need to distinguish between two cases: The "contractive" case, characterized by < 1, where is the MDP discount factor and is a (Lipschitz) continuity parameter of the transition function (cf. equation (6b)); and the "expansive" case, where > 1. Due to space constraints, we treat here in detail the former, while the results for the latter are presented without proofs, which can be found in [3]. The paper is structured as follows. In Section 2 we present the model and the notation. In Section 3 we introduce some further definitions and assumptions. In Section 4 we propose a basic version of our AAA algorithm, while Section 5 presents an improved variant of the algorithm that is required for its convergence. In Section 6 polynomial bounds on the total mistake count of this improved algorithm are presented for the "contractive" case, while Section 7 proves these bounds. In Section 8 we provide results for the "expansive" case, without proof. Finally, conclusions and future work are presented in Section 9. which specifies the reward of performing action a A in state x X. Let d : X × X R be a fixed metric on X. We assume the following regarding the state and action spaces. Assumption 1 1. The action space A is a finite set. 2. The state space X is a bounded subset of Rn . That is, there exists a constant max < such that for all x, x X, d (x, x ) max . The MDP M is used to model an environment, or a dynamic system, with which a learning agent interacts. The interaction proceeds as follows: At time t the agent observes the state xt X, chooses an action at A, receives a reward rt = r(xt , at ), and the process moves to state xt+1 = f (xt , at ). Let ht {x0 , a0 , x1 , a1 , ..., xt-1 , at-1 , xt } denote the history of observed states and actions, that is available to the agent at time t to make its choice of at . Also, let Ht ( t X × A) × X denote the space of all possible histories up to time t. Then, at each time t, the agent makes its decision according to some decision rule t : Ht A, so that at = t (ht ), t 0. The collection = {t }t=0 is the control policy. A policy is stationary if the decision rule does not change over time, and depends only on the last state observed. We shall slightly abuse notation and identify the stationary policy with the map : X A, so that at each time t, at = (xt ). In this paper we focus on the discounted return criterion. For a given initial state x0 = x, we denote the infinite horizon discounted return of state x, for a given policy , in MDP M , by t JM (x) t r (xt , t (ht )) , =0 where 0 < < 1 is the discount factor. The optimal return is denoted by VM (x) sup JM (x), which is also called the optimal value function. We often drop M from the notation above, if it does not cause confusion. A policy is optimal if J (x) = V (x) holds for all x X. For any > 0, a policy is -optimal if J (x) V (x) - holds for all x X. It is well known ([17]) that the optimal value function satisfies Bellman's equation V (x) = max {r(x, a) + V (f (x, a))} , x X, aA (1) 2 Model and Performance Metrics We denote a deterministic MDP by the 4-tuple M = (X, A, f , r), where X is a state space, A is an action space, f (x, a) is the transition function which specifies the next state x X given the previous state x X and action a A, and r(x, a) [rmin , rmax ] is the immediate reward function uncertainty in our model is the (deterministic) aggregation error, rather than stochastic sampling error. 2 These works refer to this metric as the sample complexity of exploration. and any stationary deterministic policy which satisfies (x) argmaxaA {r(x, a) + V (f (x, a))} , x X, is an optimal policy. Let Q(x, a) r(x, a) + V (f (x, a)) denote the action­value function, or Q-function, which provides the return of choosing an action a in state x, and then x following an optimal policy. Also, we let Vmax rma . 1- Note that Vmax is the maximal possible discounted return of any policy. Our main performance metric will be mistake bound (or policy-mistake bound), introduced for RL in [12]. It counts the number of time steps t in which the algorithm executes a not -optimal policy from the current state, xt . Specifically, let t be the decision rule that the algorithm uses at time t to choose its action. Then, given ht , At = {k }k=t is a (non-stationary) policy that the algorithm implements at time t, and k=t k-t rk J At (xt ) can be interpreted as the return of this policy from time t onward, where rk = r(xk , k (hk )) and xk+1 = f (xk , k (hk )). Now, the policymistake count is defined as t J PM( ) I At (xt ) < V (xt ) - . (2) =0 This definition will be justified in Section 5 (see Definition 3 and the remark after this Definition). Of course db (s, s ) is not a distance in a regular sense, since it can be negative. Also, for given state x X, let sx S be the cell that includes x (x sx ). Then, given a cell s S, we define the biased state-to-cell distance db (s, x) db (s, sx ). b. Feasible Splitting Schemes and Grids Given a source grid S1 and a candidate s S1 to split, a splitting scheme tells us how to split s into csplit cells s1 , ..., scsplit , with si sj = Ø, i si = s, to form a refined (target) grid S2 S1 . We are interested in splitting schemes that decrease the size (s) of a cell s. More formally, we require the following condition on the splitting scheme. Definition 1 (Feasible Splitting Scheme) A splitting scheme with splitting coefficient csplit is feasible if there exists 0 < < 1 (independent of s) such that (si ) (s) for i = 1, ..., csplit . Now, given a fixed feasible splitting scheme and an initial grid S0 over the state space X, we define the set of feasible grids as the set of all grids that can be obtained by using this given scheme starting from S0 . c. Continuity Assumption The following continuity assumption will be imposed on the basic model parameters. Assumption 2 There exist constants > 0 and > 0 such that, for all x1 , x2 X and a A it holds that |r(x1 , a) - r(x2 , a)| · d(x1 , x2 ), d (f (x1 , a) , f (x2 , a)) · d(x1 , x2 ). (6a) (6b) For deterministic domains with finite state­space, we have the following near-optimality criterion. The Policy-Mistake Bound Criterion A learning algorithm is PAC (Probably Approximately Correct) if there exists a polynomial | 1 1 B = B X| , |A| , , s 1- uch that for all > 0, PM( ) < B . Note, that while the "probably" aspect is absent in our deterministic case, we will find it convenient to keep the PAC terminology here. A possible alternative to the policy-mistake count is the action-mistake count, defined as follows: AM( ) t I {Q(xt , at ) < V (xt ) - )} . =0 (3) This criterion counts the number of sub­optimal actions, that is, the number of times that an algorithm executed an action whose action­value is -inferior to the optimal value. It is easily verified (see Corollary 1 in [3]) that policy-mistake count is a stronger criterion, in the sense that AM( ) PM( ). Hence we focus here on the former. In the above definition, the bound B depends on the number of states |X|. In case |X| is infinite, some other measures of X must be considered. As already mentioned, in our case we will replace |X| by the number of cells in sufficiently dense grid over the state space. 3 Preliminaries a. Grid­Cell Notation A grid S over the state space X is a partition of X into disjoint elements that covers the whole of X. We call any s S a cell. We say that a grid S2 is a refinement of a grid S1 , if for every cell s S2 there exists s S1 , such that s s . We denote this relation by S2 S1 . For any sets A and B of cells, we define the intersection operator between these two sets as AB {sA sB : sA A, sB B } \ {Ø} . (4) A continuity assumption of some kind is obviously essential for generalization in continuous state spaces. Assumptions of similar nature to the one above were used in various works on state aggregation, such as [21, 9]. However, we note that the specific assumptions used in these papers refer to continuity of probability densities. Consequently they are too strong for the continuous deterministic case as they imply that all states are mapped to the same target state. In this paper we will treat in detail the case < 1 (where is the discount factor), in which there is some "contraction" effect in the system dynamics. Results for the complementary case of > 1 are presented without proof in Section 8. We assume that both and are known for the purpose of learning. 4 The Basic AAA Algorithm In this section we present the basic variant of the AAA algorithm, which is directly based on the principle of splitting frequently visited cells. As it turns out, this algorithm may fail in some cases, and therefore no theoretical guarantees will be presented. Instead, we will provide an example, showing the source of the problem. This will provide the motivation for the improved scheme in the next section. In the following subsections we present the different parts of this algorithm in detail. An outline of the complete algorithm is presented as Algorithm 1. For a given cell s S, let (s) supx,x s d (x, x ) denote the cell size (or diameter) in the given metric d. For two given cells s, s S, we define the biased distance between these cells as db (s, s ) nf i xs,x s -(s), d (x, x ) , s = s , s=s. (5) Algorithm 1 Basic Adaptive Aggregation Algorithm (outline) Input parameters: Maximal reward rmax , Lipschitz continuity parameters and , Count threshold N , Cell size threshold . Initialization: 1. Initialize the grid to some initial grid S0 (a) = S0 for all a A, and the cell count N (s, a) = 0, for all a A, s S0 ; 2. For all s S0 (a) and a A, initialize the reward upper bound and the transition uncertainty set: r(s, a) = rmax , CIf (s, a) = S0 (a). ~ For times t = 0, 1, 2, ... do: 1. Policy Computation: Algorithm 2 2. Policy Execution: Algorithm 3 3. Cell Splitting: Algorithm 4 At any time t, and for every a A and s St (a), we define the reward uncertainty interval around the empirical reward (7) as3 : = CIr (s, a) r s, a), r(s, a) ~ ( [r(s, a) - (s), r(s, a) + (s)] ^ ^ if the pair (s, a) was sampled till time t; otherwise, the reward uncertainty interval for this pair is inherited from the parent cell. By the continuity Assumption 2, this uncertainty interval satisfies that r(x, a) CIr (s, a), x s. Also, the transition uncertainty set is defined as: s , ,^ CIf (s, a) s St : db f (s, a) (s) where db is the biased distance defined in (5). (If the pair (s, a) was not sampled till time t, the uncertainty set is inherited from the parent cell as in the reward case). Again, by the continuity assumption, this uncertainty set satisfies: f (x, a) CIf (s, a), x s. Using this notation, we define the following dynamic programming operator. Definition 2 The upper DP operator at time t for any given function g : St R is ~ . T1 g (s) = max r(s, a) + max g (s ) aA s CIf (s,a) a. Action­Grids and Common Grid In our algorithm, we will use a separate grid for every action. This will allow to use a different resolution for each action. We denote by St (a) the grid that is used by the algorithm at time t for action a. We denote by St the coarsest grid which is a refinement of all St (a) at time t. That is St a A Now, using this operator, we define the upper value function (UVF) as the solution of the following fixed point equation: Vt (s) = T1 Vt (s), s St . It can be shown that this equation has a unique solution, which can be found using Value Iteration or linear programming. Moreover, we can show that this solution is indeed an upper bound on the optimal value function V . In addition, on dense enough grid this solution is also very close to V . We do not provide here proofs for these claims. However, these claims easily follow from our analysis of the improved algorithm in Section 7. The policy that is used in the algorithm is now the optimal (or greedy) policy with respect to Vt (s): ~ , Vt (s ) s St . t (s) = argmax r(s, a) + max aA s CIf (s,a) St (a), where the intersection operator is defined in (4). We call this grid a common grid (at time t). This grid will be used to compute the value function, while the action­grids are used for empirical model estimation. b. Empirical Model We use a single sample to estimate empirically the reward and transition. Specifically, suppose that we choose action a in cell s. We thus obtain the sample (x, a, r = r(x, a), x = f (x, a)), with x s and x s . We define the empirical model based on this single sample: r(s, a) = r, ^ ^ f (s, a) = s . (7) (8) This policy is recalculated only when a cell-action pair is visited for the first time, or some cell is split. We summarize the UVF and policy computation algorithm in Algorithm 2. d. Policy Execution As we have seen, the decision rule t that is used at each time t, is determined by the UVF, as presented in Algorithm 2 (equation (10)). In addition, in each execution of the decision rule, a new sample is obtained, and the empirical model and the uncertainty intervals of the corresponding cell­action pairs are updated. This process is summarized in Algorithm 3. We drop the time index from most of our notation for ease of exposition. 3 Once the sample from (s, a) is obtained, the model remains unchanged for this pair (until the cell is split). c. Uncertainty Intervals and Upper Value Function In the AAA algorithm we will use an uncertainty intervals exploration technique as it applies to deterministic systems due to aggregation. Below we present the definition of the uncertainty intervals in case of continuous state space, and how we use them in the algorithm. Algorithm 2 Policy Computation If the model has been changed (that is, some cell-action pair has been visited for the first time, or some cell has been split): a 1. Compute the UVF over St = A St (a) by solving Vt (s) = T1 Vt (s), s St , (9) Algorithm 3 Policy Execution (i) Execute the action at = t (st ), with st St being the current common grid cell. For the visited cell­action pair (st , at ) = (st , a), let s St (a) be the cell in the action­grid that contains st . (ii) Update the counter: N (s, a) := N (s, a) + 1. (iii) If (s, a) is visited for the first time, compute the model of this pair. Namely, (a) Compute the empirical reward and transition according to equations (7) and (8). (b) Compute the upper reward value r(s, a) := r(s, a) + (s), ~ ^ (11) where T1 is defined in Definition 2. 2. Compute the corresponding optimal policy ~ . Vt (s ) t (s) = argmax r(s, a) + max aA s CIf (s,a) (10) If more than one action achieves the maximum, choose the first one in lexicographic order. Otherwise, use the previously computed value and policy: Vt = Vt-1 and t = t-1 . (c) Compute the transition uncertainty set s s . ,^ CIf (s, a) := St : db f (s, a) (s) (12) (d) Save the basic sample (x, a, x ) obtained for this (s, a), with x = xt and x = xt+1 . e. Splitting Method Assume that a fixed feasible splitting scheme is used throughout (cf. Definition 1). Define a count threshold N . We will split a cell if the number of visits to it exceeds N . In addition to this splitting criterion, we also employ a "stop­ splitting" rule, based on the size of the cell. Let be a (small) cell size threshold parameter. Then, if a cell s satisfies (s) , it will not be split anymore. Since the number of times that the algorithm encounters a pair with (s) > can be bounded, it follows that the number of different (stationary) policies that the algorithm uses can also be bounded. This will eventually enable us to prove a bound on the policy-mistake count, in Section 7. Now, under a fixed feasible splitting scheme, we denote by S the coarsest feasible grid with (s) for all s S . We call this grid -optimal grid. The number of cells in S can be bounded as follows (see Lemma 6 in Section 7): og1/ (csplit ) max N , (13) |S | |S0 | csplit l where S0 is the initial grid, and csplit are the parameters of the splitting scheme (Definition 1), and max is the diameter of the state space (see Assumption 1). We summarize the splitting process in Algorithm 4. Recall that the complete AAA algorithm is outlined in Algorithm 1. f. Why the Basic AAA Scheme Might Fail To realize the problem, consider some cell s. The value of the function V at that cell is computed based on the optimistic next-step cell s1 : s1 s Algorithm 4 Splitting Algorithm 1. Initialize St+1 (a) = St (a), for all a A. 2. For each cell­action pair (s, a), with s St (a), which satisfy N (s, a) N and (s) > , perform the following: (a) Split this cell-action pair according to the given (feasible) splitting scheme. Let s1 , ..., scsplit St+1 (a) be the resulting sub-cells after this split. Let sk be the cell that contains the sample of the parent cell s. (b) Initialize the reward upper bounds of the new cells: r(sj , a) = r(s, a), j = k , ~ ~ r(sk , a) = r(s, a) + (sk ), ~ ^ (c) Initialize the transition uncertainty sets of the new cells: CIf (sj , a) = CIf (s, a), j = k , s s , ,^ CIf (sk , a) = St : db f (s, a) (sk ) (d) Update the counts of the new cells as follows: N (sj , a) = 0, j = k , N (sk , a) = 1. argmax Vt (s ) CI (s,a) f (14) (cf. equation (9) and Definition 2). However, it may happen that the actual process never visits s1 , but rather some other cell in CIf (s, a). This may happen irrespectively of how small s is, or how many times it is visited. This might be the case, for example, if some points (states) in s map under f (x, a) to the border between s1 and some adjacent cell s2 , and all visits to s are to that part that maps to s2 . In that case, cell s1 which is not visited will remain large, hence with potentially large error in its empirical estimates. This may lead to a large error in the estimated value function at s, and consequently to an error in the computed policy. We propose a solution to this problem in the next section. This new operator smoothes g (s) before applying to it the DP operation. As before, we define the UVF as the solution of the fixed point equation: Vt (s) = T Vt (s), s St . Our second modification involves splitting of "virtually visited cells". We next define the required notation. Definition 5 (Virtually Visited Cells) At any period [t0 , t1 ] of the algorithm's execution: 1. Let {st }t1 t0 be the actually visited cells st St that = t are visited during this period, and {at }t1 t0 be the cor= responding actions, with at = t (st ). 2. Let {(st , at )}t1 t0 be the actually visited cell­action = pairs during this period, with st St (at ), such that st st . 3. Denote the virtually visited cells during this period by t1 +1 {s }t=t0 +1 , where s St and is the argument of the t t maximization s+1 argmax T2 Vt (s ). t s CI (st ,a ) t f 5 The AAA Algorithm To overcome the potential pitfalls of the basic AAA algorithm above, we need to modify the definition of the upper value function so that it will more closely approximate the optimal one. Two modifications will be introduced. First, we propose splitting of the optimistic next-step cells (recall the cell s1 defined in (14)), in addition to actually visited ones. Those cells, which we call "virtually visited" cells, will be defined formally in Definition 5 below. However, splitting those cells is not enough to fix the problem, since it may happen that no actual samples are obtained for the created cells. Thus, we introduce a smoothing operator in the DP equations. This operator, which is specified in Definition 3 below, allows to improve the accuracy of the upper value function in (small) cells even if they are not actually visited (hence not actually sampled), based on the values of their geometric neighbors. In what follows we will focus on the case < 1. As can be seen in the proof of Lemma 1 (Section 7), in this case we have some sort of a contraction effect. Thus, the results are technically much simpler than for > 1. The latter case will be briefly discussed in Section 8, and is treated in detail in [3]. Definition 3 Let the continuity function of the optimal value be defined as , 0. 1 - The smoothing operator at time t for any given function g : St R is () T2 g (s) s t t 4. For each virtually visited common grid cell s St , t let {sa }aA be the action-grid cells (sa St (a)) which contain s . We define the virtually visited cell-action t pair as the pair (st , at ) = (sa , a) with the cell sa hav~~ ing the smallest diameter among those action-grid cells. In the course of the algorithm, both actually and virtually visited cell-action pairs will be split (using a count threshold as before). We note that the splitting of virtually visited cells is needed in the common grid St , to avoid the problem presented in Section 4. This splitting can be done directly in Sa. However, we have chosen to keep the relat tion St = A St (a). Thus, we will split the smallest cell s St (a) which contains the virtually visited cell s St ~ ~ that is candidate for splitting. In this way, the split is inherited by St . To summarize, the following modifications are introduced in the AAA algorithm: In Algorithm 2, equation (9) is replaced by Vt (s) = T Vt (s), s St , (15) where T is defined in Definition 4. Also, equation (10) is replaced by ~ . t (s) = argmax r(s, a) + max T2 Vt (s ) (16) aA s CIf (s,a) min {g (s ) + ( (s) + db (s, s ))} . St Remark. Note that by definition of the biased distance db in (5), the above minimized set also includes g (s), since for s = s, g (s) + ( (s) + db (s, s)) = g (s) + (0) = g (s). Therefore, T2 g (s) g (s) for all s St . It is shown in Lemma 1 in Section 7, that the continuity function is in fact a bound on the modulus of continuity of the optimal value function V . The definition of the smoothing operator is then formally justified in Lemma 3, which states that if g is an upper value function, that is g (s) V (x) for all s and x s, then so is T2 g . Thus, the smoothing operator T2 tightens the upper value function g based on the values in adjacent cells. Now, using the smoothing operator, we modify the definition of the upper DP operator (Definition 2). Definition 4 The smoothed upper DP operator at time t is defined by T T1 T2 . That is, for given function g : St R, ~ . T g (s) = max r(s, a) + max T2 g (s ) aA s CIf (s,a) Finally, in Algorithm 3, step (ii), we update the counters of both the actually and virtually visited cell­action pairs: N (st , at ) := N (st , at ) + 1, N (st , at ) := N (st , at ) + 1. ~~ ~~ 6 Main Results ( < 1) In this section we summarize the main results regarding the AAA algorithm. Proofs are deferred to the next section. Recall the definition of the mistake count (2) and the corresponding near­optimality criterion. Also, recall that S is the coarsest (feasible) grid with (s) , which satisfies (13). First we present the main theorem, which provides a mistake bound of modified AAA scheme in terms of the number of cells in S . Theorem 1 Let > 0 and assume that the AAA algorithm receives an input = (1 - )(1 - ) . 2( + 2) Then, the policy-mistake count of the algorithm is bounded by PM( ) |S | |A| (2N + 1) 2 (rmax - rmin ) ln . 1- (1 - ) In addition to the above theorem, we can obtain a possibly tighter mistake bound in terms of the posterior number of cells actually used in the course of the algorithm. In fact, the purpose of adaptive aggregation is that as time progresses, the algorithm will split cells only in the vicinity of the optimal trajectory. Therefore, the actual number of grid cells "at infinity" will be much less than |S |. We make this more formal below. Definition 6 Let x0 be the initial state and let N (x0 , a) be the number of cells in the grid limt St (a), that is N (x0 , a) Also, let N (x0 ) t N (x0 ). If we choose N too small, the algorithm will perform many splits, and consequently N (x0 ) will be large. In this case it may happen that the algorithm will produce redundant cells, which are not actually needed for near-optimal performance. On the other hand, if we choose large N , the algorithm will perform less splits, resulting in a smaller N (x0 ). This however may lead to a slower convergence to the optimal trajectory. Two comparisons that may be of interest follow. First, consider our algorithm for the "flat" model, which uses a sufficiently fine grid (namely, S ) over the state space. It can be shown that the mistake bound in such case will be as in Theorem 1, with 2N + 1 replaced by 1. Clearly, however, such an algorithm is not feasible if |S | is large. ¨ Moreover, consider a naive approach, where the "flat" model is treated as a finite-state MDP, and an efficient exploration technique is used on this MDP (such as the R-MAX algorithm [7]). In the ideal case when the MDP assumption happens to hold true, such an algorithm will again have the mistake bound as in Theorem 1, with 2N + 1 replaced by 1 (see for instance [12], Theorem 8.3.5). However, as this assumption generally is not satisfied, the computed value function might underestimate the optimal one, resulting in algorithm's failure (and, in fact, in infinite mistake count). 7 Analysis of the AAA Scheme Below is the outline of the analysis. First we show that the optimal value function V possesses some continuity property, which will justify the use of the smoothing operator T2 . Then, we show that there exists a unique solution to equation (15), and that this solution upper bounds the optimal value V . Finally, we prove that under certain conditions on the grid, the optimal policy with respect to the UVF (equation (16)) is an -optimal policy, which will enable us to prove a polynomial bound on the policy-mistake count of the algorithm. For ease of exposition, throughout the analysis we write CI for the transition uncertainty set (instead of CIf ), and denote by Vb 1 1- (rmax - rmin ) the maximal difference between two returns of any two policies. Also, recall that the proofs presented below are limited to the < 1 case. a. Continuity of the Optimal Value Function In this subsection we show that under the continuity Assumption 2, the optimal value function is also Lipschitz continuous4 . Lemma 1 For any given x1 , x2 X, we have that d(x1 , x2 ) (d(x1 , x2 )) . |V (x1 ) - V (x2 )| 1 - Proof. Fix x1 , x2 X. From the optimality equation (1), we have that |V (x1 ) - V (x2 )| max |r(x1 , a) - r(x2 , a)| a lim |St (a)| . (17) aA N (x0 , a). We note that the limit in (17) exists and is finite, since |St (a)| increases in t, while |St (a)| |S | due to the enforced "stopsplitting" rule. For the same reason, we have the trivial bound N (x0 ) |S | |A|. Theorem 2 Let > 0 and assume that the AAA algorithm receives an input as in Theorem 1. Then, it holds that PM( ) 4N (x0 )N 2 (rmax - rmin ) ln . 1- (1 - ) Note that the bound of Theorem 2 becomes better than the bound of Theorem 1 if N (x0 ) 1 |S | |A|. 2 Remark. Since the action-mistake count satisfies AM( ) PM( ), the policy-mistake bounds of Theorems 1 and 2 apply also to the action-mistake. Discussion. Theorem 1 implies that the mistake bound is linear in |S |. Therefore, using equation (13), we obtain the following explicit dependence on (ignoring the log factor): PM( ) C (1/ ) |A| (2N + 1) , n (18) where the constant C is polynomial in , 1/(1 - ) and in 1/(1- ). Note however the exponential dependence on the dimension n of the state-space, which is an obvious artifact of the dense aggregation approach. In the context of the posterior bound (Theorem 2), it should be noted that there is a trade-of between the choice of the count threshold N and the number of cells at infinity + max |V (f (x1 , a)) - V (f (x2 , a))| a d(x1 , x2 ) + max |V (f (x1 , a)) - V (f (x2 , a))| , a 4 In case > 1 it is H¨lder continuous, see [3] for details. o where the second inequality follows by Assumption 2. Also by this assumption, we have that d(f (x1 , a), f (x2 , a)) d(x1 , x2 ), for any a. Applying the above inequalities iteratively, for any integer H > 0, we obtain the following bound: |V (x1 ) - V (x2 )| d(x1 , x2 ) H -1 k =0 Thus, g2 (s) = g1 (s) V (x) for all x s. Otherwise, let5 xmin s and x in s be such that db (s, s ) = m d(xmin , x in ). We have that m g2 (s) g1 (s ) + ( (s) + db (s, s )) V (x in ) + ( (s) + db (s, s )) m V (x), ( ) + H Vb . k Now, since < 1, we can take H = in the above bound, and obtain the desired result. b. The Upper Value Function First we prove the contraction property of the upper DP operator used in the fixed point equation (15). Lemma 2 The operator T is a contraction mapping in the norm, with the contraction factor . Thus, there exists a unique solution to equation (15). Proof. Given two functions g1 and g2 , we have the following sequence of inequalities: ( T g1 )(s) - (T g2 )(s) m m max ax T2 g1 (s ) - ax T2 g2 (s ) aA s CI(s,a) s CI(s,a) where the first inequality follows by hypothesis for the state x in s , and the second inequality holds for every x s m by Lemma 1, since d(x, x in ) d(xmin , x in )+d(x, xmin ) db (s, s )+(s). m m Lemma 4 The UVF Vt is indeed an upper bound on the optimal value function. That is, at every time t, we have that Vt (s) V (x), s St , x s. Proof. Since, by Lemma 2, T is a contraction operator, we can prove the claim by induction on the steps of value iteration. For the base case, let V 0 (s) Vmax V (x), x X. Now assume that the claim holds for n-th iteration. For n+1th iteration we have by the Lipschitz continuity of the reward (Assumption 2) and by the definition of r(s, a), that for all ~ s St and x s, r(s, a) = r(xs , a) + (s) r(x, a), ~ where xs is a sample point in s. Also, by Assumption 2 and by the definition of CI(s, a), it follows for any x s, that f (x, a) s , with s CI(s, a). Thus, s CI(s,a) max s St aA s CI(s,a) max |T2 g1 (s - T2 g2 (s )| ) ) max |T2 g1 (s - T2 g2 (s | . Now, since |T2 g1 (s ) - T2 g2 (s )| m = in {g1 (s ) + ( (s ) + db (s , s ))} - s St - mn {g2 (s ) + ( (s ) + db (s , s ))} i s St s ) max T2 V n (s ) T2 V n (s : f (x, a) s ) V (f (x, a)), m x |g1 (s a St ) - g2 (s )| = g1 - g2 , where the last inequality follows by the induction assumption and Lemma 3. Therefore, we have ~ V n+1 (s) = max r(s, a) + max T2 V n (s ) aA s CI(s,a) ( it follows that T g1 )(s) - (T g2 )(s) g1 - g2 for T all s St . Hence, g1 - T g2 g1 - g2 , which proves the result. We will need the following property of the smoothing operator T2 . Lemma 3 If g1 : St R is an upper bound on the value function (that is, g1 (s) V (x) for all s St and x s), then so is g2 T2 g1 . Proof. For given s St , let s be the cell that achieves the minimum in the smoothing operator T2 : s = argmin {g1 (s ) + ( (s) + db (s, s ))} . s S t max {r(x, a) + V (f (x, a))} aA = V (x). which completes the induction proof. Since V n V , the result follows. c. Near­Optimality of the UVF Optimal Policy In this section we provide a sufficient condition on the grid, which ensures that the return obtained by the policy At = { } =t which the algorithm implements at time t, is -close A to the UVF: Vt (s) - JMt (x) , for a given s St and all A x s. This will imply that V (x) - JMt (x) , since Vt is an upper bound on the optimal value; namely, this will imply that At is an -optimal policy. To proceed, we introduce the definitions of known cell­ action pairs and the escape event. 5 If s = s , then by definition of the biased distance (5) we have that db (s, s ) = -(s), implying that ( (s) + db (s, s )) = (0) = 0. A denotes the closure of a set A. Proof. At given time t0 , we consider the execution of the (non-stationary) policy At0 for T /2 time steps in M . We will use the notation of visited grid cells specified in DeAKt {(s, a) St (a) × A : (s) , (s, a) was sampled} . finition 5, with t = t + T 1 0 /2 - 1. Now, we have two mutually exclusive cases: (a) For all (st , at ) it holds that Also, define the set of virtually known cell­action pairs: (st ) and (st , at ) was sampled, and for all s it holds t V Kt {(s, a) St (a) × A : (s) } . that (s ) . (b) There exists at least one t [t0 , t1 ] t such that the above condition regarding either st or s does t The following is a standard definition, which specifies a not hold. mixing time of any stationary policy in discounted MDPs. The case (b) above is easy ­ if it happens, we have that Definition 8 ( /2-Horizon Time) In an MDP M , the /2either a pair (s , a) not in AKt is encountered, or a virtually horizon time is defined to be visited cell s with (s ) > is encountered. In the latter case, since the corresponding virtually visited cell-action 2Vb T /2 log1/ pair (s, a) satisfies s s, we have that this pair is not in ~~ ~ . V Kt . Thus, the escape event Et0 (s) occurred during the execution of At0 for T /2 time-steps, which is expressed by the Definition 9 (Escape Event) At any time t, define the actual escape event from a given starting cell s St : I {Et0 (s)} Vb term in the bound. Now, if (a) is the case, during the execution of At0 for In a path starting from cell s and following T /2 time steps we stay in the same episode, and thus the AEt (s) At for T /2 steps in M , an actually visited algorithm's policy remains unchanged and it is the stationary pair (st , a ) not in AK is encountered. policy t0 . For simplicity, assume t0 = 0, write for 0 and t t V for V0 , and recall that is the greedy policy with respect Also, define the virtual escape event from a given starting cell to V (equation (16)). Thus, V (s0 ) = r (s0 , a0 ) + T2 Vt (s ). ~ 1 s St : Also, by Bellman's equation, In a path starting from cell s and following JM (x0 ) = r (x0 , a0 ) + JM (x1 ). V Et (s) At for T /2 steps in M , a virtually visited Now, we have that pair (s , a ) not in V K is encountered. ~t ~t t V (s0 ) - J (x0 ) M T Finally, the escape event is defined as 0 V 2(s ) + 2 Vt (s ) - JM (x1 ) 1 Et (s) = AEt (s) Et (s). 2(s0 ) V Definition 10 (Episode) An episode is a maximal period of (s1 ) + ( (s ) + db (s , s1 )) - JM (x1 ) + 1 1 time [t0 , t1 ], in which: 2(s0 ) (i) All actually and virtually visited cell­action pairs till V . time t1 - 1 satisfy + ( (s ) + 2 (s0 )) + (s1 ) - JM (x1 ) 1 (st ) , (st ) , ~ The first inequality follows by the definition of virtually vist -1 ited cells, and by the continuity assumption on the reward, and in addition, {(st , at )}t1 t0 were previously sampled = since (that is, were previously visited). Definition 7 (Known Pairs) At any time t, define the set of actually known cell­action pairs: (ii) At time t = t1 , the algorithm encounters a pair for which the condition in (i) is not true. Note that during each episode, a fixed stationary policy is used by the algorithm, and the policy is (potentially) changed only at the beginning of each episode. Also, observe that the above definitions depend on the cell size threshold parameter . The next lemma formulates the condition on which will imply that the execution of the algorithm's implemented policy At from time t will obtain a return which is -close to the UVF. Lemma 5 Let > 0 be given and assume that Then, Vt (s) - J At (x) + I{Et (s)}Vb M holds for all t, s St and x s. = r (s0 , a0 ) - r (x0 , a0 ) ~ = r (x , a0 ) + (s0 ) - r (x0 , a0 ) 2(s0 ), where x is the sample that was received for this cell-action pair. The second inequality follows by the definition of the smoothing operator (Definition 3). Note that s1 S1 by its definition (cf. Definition 5). Finally, the third inequality follows since both s and s1 are in CI(s0 , a0 ), having 1 s + ^ ^ db (s1 , s1 ) db , f (s0 , a0 ) db f (s0 , a0 ), s1 1 (s0 ) + (s0 ) . Thus, proceeding iteratively, we obtain the following bound: V (s0 ) - JM (x0 ) (1 - )(1 - ) . 2( + 2) (19) + t/2 T -1 t /2 2 s + (st ) + 2 (st ) t+1 =0 T Vb . By the definition of T /2 (Definition 8), we have T /2 Vb 2 . Now, we have to check that the condition (19) of the lemma regarding , implies that T /2 -1 Substituting the last inequality in (20) yields s k ( s) csplit N= S0 = t =0 2 + s (st ) + 2 (st ) t+1 t 2 . csplit s S0 (csplit ) log1/ (s) Indeed, T -1 t/2 =0 T csplit 2 s + t (st ) + 2 (st ) t+1 t [2 s S0 (s) og1/ (csplit ) l og1/ (csplit ) |S0 | csplit max l , t/2 -1 =0 + ( + 2 )] which completes the proof. Remark. We note that Lemma 6 shows an exponential dependence of N on the state space dimension n since in most cases log1/ (csplit ) is of order of n. e. Proof of Theorem 1 First, we note that the escape event Et (s) (Definition 9) can be viewed as an exploration event. If it occurs at some time t 0, the algorithm will encounter (in an execution of length T /2 ) a cell-action pair (s, a) (either actually, or virtually), with s that is not in S . In addition, in case of actual escape event (see Definition 9), this pair was not sampled. This fact can be interpreted as a "discovery" of new information, since every such occurrence of an "unknown" pair will lead to an increase of the count of such pair, and, eventually, to split of such a pair. Next two lemmas show that the number of times that "actual" and "virtual" escape events can occur is bounded. Lemma 7 The number of times that AEt (s) can occur is bounded by N |A| (N + 1) T /2 . Proof. Note that any cell s St (a) for any a and t, can be visited no more then N times ­ after this number of times, the cell is split. Now think of the grid representation of the state space as a tree, with cells as leaves. The internal nodes in such tree represent the larger aggregations, that were used in previous episodes. Now, the number of such internal nodes is less or equal to the number of leaves, since the splitting coefficient is greater or equal to 2. Using this tree representation, the visit to the "unknown" pair can be interpreted as a visit to an internal node of S . Since the counter of this pair is incremented in this visit, by a simple counting argument (a.k.a. the Pigeonhole Principle), the number of t iimes that the algorithm can encounter an internal node of S s bounded by (number of internal nodes of S ) · N · |A| N N = = t =0 t [2 + ( + 2 )] 2 1 + (1 + 2 ) 1- 1 - 2 2 + = , (1 - )(1 - ) where the first inequality follows since we are addressing case (a), in which all actually and virtually visited cells are smaller then , the second inequality holds by taking the infinite some instead of the finite one, the first equality follows by the definition of (see Definition 3), and the last equality follows by the hypothesis (19) of the Lemma. This completes the proof of the Lemma. d. Number of Cells in -Optimal Grid Before proving the mistake bounds, we provide an upper bound on the number of cells N = |S |. Lemma 6 For a fixed feasible splitting scheme, with parameters csplit and , and a single initial grid S0 , we have that N |S0 | csplit max l og1/ (csplit ) . Proof. For every s S0 , consider performing k (s) splits iteratively, such that at each iteration we obtain new csplit cells instead of the original one. It follows that after k (s) such splits, the size of a split cell s s satisfies (s ) k(s) (s). In addition, the number of cells in the grid that contains all such cells s is s k (s) N= csplit . (20) S0 | A| . Thus, for each s S0 , we need to find the minimal k (s), such that k(s) (s) . (21) From (21), it follows that this minimal k (s) = k (s) satisfies (s) (s) log1/ k (s) < log1/ 1. + Finally, when the algorithm encounters a leaf of S , then only one such occurrence is sufficient in order to the desired cell to become sampled. Again, by a simple counting argument, the number of times this can occur is bounded by (number of leaves of S ) · |A| = N |A| . To conclude, the number of times that an "unknown" cellaction pair can be encountered is bounded by N N |A| + N |A| = N (N + 1) |A| . At each time t, consider the execution of a (non-stationary) policy At for T /2 time steps in M . We have two mutually exclusive cases: (a) If starting at time t, we execute the policy At for T /2 time steps, without encountering an "unknown" pair (that is, a pair not in AKt ), there is no occurrence of the escape event AEt (s). (b) If starting at time t, we execute the policy At for T /2 time steps, and encounter at least one unknown pair at time t t t + T /2 , the escape event AEt (s) occurs. We then wish to bound the number of time steps that (b) is the case. In the worst case we will encounter an unknown pair at the end of the execution period of length T /2 . In this case, we have that all the succeeding executions for t < t t + T /2 will also encounter this unknown pair. That is, if AEt (xt ) occurs at some time t, also AEt (xt ) for t < t t + T /2 will occur, in the worst case. Since after N (N + 1) |A| visits to unknown pairs, all the pairs will become known, AEt (s) can occur at most N (N + 1) |A| T /2 times. Lemma 8 The number of times that V Et (s) can occur is bounded by N |A| N T /2 . Proof. The proof is similar to the proof of Lemma 7, with the difference that the virtual escape event cannot occur on the leaves of S . Lemma 9 The number of times that the escape event Et (s) can occur is bounded by N |A| (2N + 1) T /2 . Proof. Follows by Lemmas 7 and 8 and Definition 9. Finally, we prove the main theorem regarding the mistake bound of the AAA algorithm. Proof of Theorem 1. For each time t, we consider the execution of policy At for T /2 time-steps in M , with the initial state in each such execution xt (xt st ). We then have that A JMt (xt ) Vt (st ) - - I {Et (st )} Vb V (xt ) - - I {Et (st )} Vb , Thus, we need to bound the number of times that an escape event occurs. Here we consider the trees that represent the grids "at infinity", namely S (a) = limn St (a), a A, instead of the -optimal grid S . First, consider the actual escape event. As previously, this event can occur on internal nodes of S (a) no more than (number of internal nodes of S (a)) · N N (x0 , a)N times. The leaves of the tree (which are the cells of S (a)) can be classified into two groups: (a) "Small" leaves, with (s) , and (b) "Large" leaves, with (s) > . On "small" leaves, only one occurrence of the escape event is possible, since such a cell becomes known (Definition 7) after one sample. On "large" leaves, there cannot be more than N occurrences of the escape event ­ otherwise these cells would have been split. Thus, the number of times the escape event can occur on leaves is bounded by (number of leaves of S (a)) N = N (x0 , a)N . To summarize, the number of times that the (actual) escape event aan occur on all nodes, for all actions a A, is bounded c by A 2N (x0 , a)N . Similarly, the virtual escape event cannot occur more than a A 2N (x0 , a)N times. Note that there is no difference in bounds on the number of occurrences of actual and virtual escape events, since the cells of S (a) can be "large" ones, that is having (s) > . Thus, the virtual escape event can also happen on leaves. By the same arguments as in proof of Theorem 1, the sum of the above two bounds times the /2-horizon time T /2 is the mistake bound of the algorithm. 8 The Expansive Case ( > 1) Our analysis above focused on case < 1. When > 1, the analysis becomes more involved. This can be observed for example, from the bound on the distance between optimal values of two states, presented in proof of Lemma 1: |V (x1 ) - V (x2 )| d(x1 , x2 ) H -1 k =0 ( ) + H Vb . (22) k where the first inequality holds by Lemma 5, and the second inequality holds by Lemma 4. However, by Lemma 9, the number of times the event Et (st ) can occur is bounded by N (2N + 1) |A| T /2 , implying that t =0 J I At M (xt ) < V (xt ) - N |A| (2N + 1) T /2 , which completes the proof of the theorem, using the definition of the /2-horizon time, and the fact that log1/ C 1 1- ln C, for any C . f. Proof of Theorem 2 Recall the definition of the posterior number N (x0 ) of actually used cells in the course of the algorithm (Definition 6). We only need to prove the analogue of Lemma 9 in this case. The rest of the proof is exactly the same as that of Theorem 1. If > 1, instead of bounding the infinite sum of distances between future rewards, we have to employ a "cut-off tactics". Specifically, we have to make a balance between the first term in (22), which grows exponentially in H , and the second term, which decays exponentially in H . A detailed treatment of this case can be found in [3] and is omitted here due to space limitations. Using the approach outlined above, it is shown that to obtain the mistake bounds of Theorems 1 and 2, the cell size threshold should be taken as = K , where log1/ , and K is polynomial in , , 1/(1 - ) and exponential in ; note that > 1 and compare this condition to (19) in case < 1. As a result, in the expansive case we obtain a worse explicit dependence of the mistake bound on , as follows: PM( ) C (1/ ) n |A| (2N + 1) , where C is polynomial in , , 1/(1 - ), and is exponential in ; compare this bound to (18). 9 Conclusion [7] We presented a model-based learning algorithm that aims to solve the online, continuous state space reinforcement learning problem in deterministic domain, under continuity assumption of model parameters. We note that we did not address at all the issue of the computational complexity. The goal of the analysis was to show feasibility in the sense of sample efficiency. Some ideas for improvement of the proposed algorithm and its analysis follow. First, it would be interesting from computational perspective, to formulate an on-line asynchronous variant, that will perform only one back-up of Value Iteration each time-step, instead of exact calculation, and analyze its performance. Also, the explicit dependence of the posterior number of cells on the number of cells needed on the optimal trajectory remains an open and difficult question which requires new tools for its analysis. In addition, more elaborate splitting rules and possible merging schemes should be considered. Finally, evaluation of the algorithm using simulations would be interesting from the practical point of view. The extension of similar ideas to the stochastic domains seems possible, under a different continuity assumption (namely, under continuity of transition density as in [9]). Preliminary results for this case can be found in [3]. A possible future direction here is to formulate an algorithm that will work for both the stochastic and deterministic cases, under a unified continuity assumption. Another possible extension is to the case of continuous action space A, which can be approached using aggregation, similarly to the state space. Finally, other reward criteria should be considered ­ average reward (with associated loss bounds), and shortest path problems (total reward). In particular, the shortest path formulation is more natural in such deterministic problems, as navigation in maze. [8] [9] [10] [11] [12] [13] [14] [15] References [16] [1] A. Antos, R. Munos, and C. Szepesvari. Fitted Qiteration in continuous action-space MDPs. In Proceedings of Neural Information Processing Systems Conference (NIPS), 2007. [2] P. Auer and R. Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In Proceedings of Neural Information Processing Systems Conference (NIPS), 2006. [3] A. Bernstein. Adaptive state aggregation for reinforcement learning. Master's thesis, Technion ­ Israel Institute of Technology, 2007. URL: http://tx.technion.ac.il/andreyb/ MSc Thesis final.pdf. [4] D. P. Bertsekas. Dynamic Programming and Optimal Control, vol. 2. Athena Scientific, Belmont, MA, third edition, 2007. [5] A. Bonarini, A. Lazaric, and M. Restelli. LEAP: an adaptive multi-resolution reinforcement learning algorithm. To appear. [6] C. Boutilier, T. Dean, and S. Hanks. Decision-theoretic planning: Structural assumptions and computational [17] [18] [19] [20] [21] leverage. Journal of Artificial Intelligence Research, 11:1­94, 1999. R. I. Brafman and M. Tennenholtz. R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213­231, 2002. H. Chapman. Global confidence bound algorithms for the exploration-exploitation tradeoff in reinforcement learning. Master's thesis, Technion ­ Israel Institute of Technology, 2007. C.-S Chow and J.N. Tsitsiklis. An optimal oneway multigrid algorithm for discrete-time stochastic control. IEEE Transactions on Automatic Control, 36(8):898­914, 1991. C. Diuk, A. L. Strehl, and M. L. Littman. A hierarchical approach to efficient reinforcement learning in deterministic domains. In Proceedings of the 5th International Joint Conference on Autonomous Agents and Multiagent Systems, pages 313­319, 2006. G. J. Gordon. Reinforcement learning with function approximation converges to a region. In Advances in Neural Information Processing Systems (NIPS) 12, pages 1040­1046, 2000. S. M. Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, UK, 2003. M. Kearns and S. P. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49:209­232, 2002. I. Menache, S. Mannor, and N. Shimkin. Q-Cut ­ dynamic discovery of sub-goals in reinforcement learning. In Proceedings of the 13th European Conference on Machine Learning (ECML 2002), pages 187­195, 2002. A. W. Moore and C. G. Atkeson. The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning, 21:199­233, 1995. R. Munos and A. W. Moore. Variable resolution discretization in optimal control. Machine Learning, 49:291­323, 2002. M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1994. A. L. Strehl and M. L. Littman. A theoretical analysis of model-based interval estimation. In Proceedings of the 22nd International Conference on Machine Learning, pages 857­864, 2005. A. L. Strehl, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforcement learning. In Proceedings of the 23nd International Conference on Machine Learning, pages 881­888, 2006. A. Tewari and P. L. Bartlett. Optimistic linear programming gives logarithmic regret for irreducible MDPs. In Proceedings of Neural Information Processing Systems Conference (NIPS), 2007. W. Whitt. Approximations of dynamic programs, I. Mathematics of Operations Research, 3(3):231­243, 1978.