Predictive Representations for Policy Gradient in POMDPs Abdeslam Boularias boularias@damas.ift.ulaval.ca Brahim Chaib-draa chaib@damas.ift.ulaval.ca Department of Computer Science and Software Engineering, Laval University, Quebec, Canada, G1K 7P4 Abstract We consider the problem of estimating the policy gradient in Partially Observable Markov Decision Processes (POMDPs) with a special class of policies that are based on Predictive State Representations (PSRs). We compare PSR policies to Finite-State Controllers (FSCs), which are considered as a standard model for policy gradient methods in POMDPs. We present a general ActorCritic algorithm for learning both FSCs and PSR policies. The critic part computes a value function that has as variables the parameters of the policy. These latter parameters are gradually updated to maximize the value function. We show that the value function is polynomial for both FSCs and PSR policies, with a potentially smaller degree in the case of PSR policies. Therefore, the value function of a PSR policy can have less local optima than the equivalent FSC, and consequently, the gradient algorithm is more likely to converge to a global optimal solution. by probabilistic transition functions. After every executed action, the agent perceives a numerical reward and a partial, possibly noisy, observation. Instead of keeping track of ever growing histories, Finite-States Controllers (FSCs) provide an efficient graphical model for mapping histories into equivalence classes called internal-states (I-states). Most of policy gradient methods for POMDPs are based on this family of policies (Meuleau et al., 1999; Baxter & Bartlett, 2000; Shelton, 2001). FSC is like a probabilistic automaton where the observations are the inputs and the actions are the outputs, with stochastic transitions between the I-states. The main drawback of FSCs is that I-states do not directly affect the state of the environment. In fact, only the sequence of executed actions determines the returned outcome, and the same sequence of actions may be generated by different sequences of I-states. Thus, considering the sequence of I-states within the learning data slows down the convergence of RL algorithms (Aberdeen & Baxter, 2002). The principal contribution of this paper is a new policy gradient method for finite-horizon POMDPs based on a Predictive State Representation (PSR) policy. Aberdeen et al. (2007) have already proposed to use PSRs for reinforcement learning in partially observable domains. However, they used PSRs for learning the dynamics of the environment, and mapped the belief states into a distribution over actions through a linear approximator. In our approach, we use PSRs to represent policies directly without learning a model of the environment. The parameters of PSR policies are updated according to the gradient of a value function. Our second contribution is a comparison of PSR policies to FSCs. For this purpose, we use the same basic method for learning both FSCs and PSR policies, and show that the value function of a PSR policy is potentially less complex than the value function of FSC policy. This result is due to the fact that the parameters of a PSR policy are based only on observable data. Finally, we empirically demonstrate the effectiveness of PSR policies through several standard benchmarks. 1. INTRODUCTION Learning an optimal policy in a partially observable environment is still considered as one of the most difficult challenges in Reinforcement Learning (RL). To achieve this, one must consider two main issues: (1) mapping histories of interaction with the environment into an internal state representation, (2) finding the optimal decisions associated to each internal state. Partially Observable Markov Decision Processes (POMDPs) provide a rich mathematical framework for studying such problems. In POMDPs, the environment is represented by a finite set of states, and the effects of actions on the environment are represented Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Predictive Representations for Policy Gradient in POMDPs 2. Background We review the POMDP, PSR and FSC models, and show how PSRs can be adapted to represent policies. 2.1. POMDPs Formally, a POMDP is defined by the following components: a finite set of hidden states S; a finite set of actions A; a finite set of observations O; a set of transition functions {T a }, where T a (s, s ) is the probability that the agent will end up in state s after taking action a in state s; a set of observation functions {Oa,o }, where Oa,o (s) gives the probability that the agent receives observation o after taking action a and getting to state s ; and a reward function r, such that r(s, a) is the immediate reward received when the agent executes action a in state s. Additionally, there can be a discount factor, [0, 1], which is used to weigh less rewards received further into the future. The horizon H indicates the maximum number of actions that an agent can execute while performing the given task. Instead of memorizing a complete history of actions and observations ht = a1 o1 . . . at ot , a probability distribution over the hidden states, called the belief state, is sufficient to make predictions about the future observations and rewards. The belief state is a vector bt , where bt (s) = P r(st = s|ht ), s S. The expected reward of an action a given a history ht is given by: R(a|ht ) = sS of core histories. The PSR belief state1 is a vector bt , where bt (h) is the weight of the core history h H in the current history ht . The probability of any test q after a history ht is given by: P r(q o |ht , q a ) = hH def bt (h)P r(q o |h, q a ) = bT mq t (3) where mq (h) = P r(q o |h, q a ) is independent of ht , and bt (h) is independent of q. The belief state bt R|H| is a vector of real-valued weights and not probabilities. After executing an action a and receiving an observation o, the PSR belief is updated by Bayes' Rule: P r(q o |ht , a, o, q a ) = = = = = hH P r(oq o |ht , a, q a ) P r(o|ht , a) bt (h)P r(oq o |h, a, q a ) hH hH bt (h)P r(o|h, a) hH bt (h)P r(o|h, a)P r(q ao hH bt (h)m (h) hH bt |h, a, o, q a ) hH bt (h)P r(o|h, a) h H bhao (h (h)mao (h) o )mq (h ) (4) bt+1 (h)mq (h) = bT mq t+1 where bt+1 (h) = h H bt (h bt (s)r(s, a) (1) )bh ao (h)mao (h ) ao h H bt (h )m (h ) (5) 2.2. PSRs PSRs (Littman et al., 2002) are an alternative model to represent partially observable environments without using hidden states. The fundamental idea of PSRs is to replace the probabilities on states by probabilities on particular future trajectories, called core tests, or by weights of particular past trajectories, called core histories. A test q is an ordered sequence of (action, observation) couples, i.e. q = a1 o1 . . . ak ok . The probability of a test q starting after a history ht is defined by: P r(q |ht , q ) = P r(ot+1 = o , . . . , ot+k = o |ht , (2) at+1 = a1 , . . . , at+k = ak ) where q a denotes the actions of q and q o denotes its observations. The probability of any test q after a history ht depends linearly on the probabilities of the same test after the different core histories. We use H to indicate the set o a def 1 k The parameters of a PSR based on core histories are: A, O, the set of core histories H, and the vectors mao and bhao , a A, o O, h H. The vector bhao denotes the belief state at the history hao. Notice that the vectors mao define a probability distribution over the observations o for each action a and core history h, thus, they can be represented by a softmax function. This latter property is not satisfied in PSRs based on core tests, and it lays behind our choice of using core histories in this paper. However, most of the results that we will present also apply to PSRs with core tests. 2.3. FSCs The goal of a rational agent is to find an action that maximizes its expected long-term reward after a given history. The function that selects an action a given a history ht is called policy. A parametric stochastic policy is a function defined by (a, ht , ) = P r(at+1 = a|ht , ), where is a vector of real-valued parameters = (1 , 2 , . . . , n ). Finite-State Controllers are an 1 The same notation is used for different types of belief state; the interpretation depends on the model. Predictive Representations for Policy Gradient in POMDPs efficient graphical model used to represent stochastic policies without memorizing all the histories (Meuleau et al., 1999). An FSC is defined by: a finite set of internal states G; a set of action-selection functions µo,a , where µo,a (g, ) is the probability that the agent will execute action a if its I-state is g, its last observation was o, and the parameters of the policy are ; a set of transition functions o , where o (g, g , ) is the probability that the agent will select the I-state g if the current I-state is g and the perceived observation is o. Contrary to the states of a POMDP, the I-states of an FSC are completely observable, thus, a history ht should be a sequence of actions, observations and sampled I-states. However, the cumulated reward of a given history depends only on the executed actions and the perceived observations. Moreover, the same sequence of actions and observations can be used to update the parameters (µ and ) of all the I-states that may have generated that sequence, with different likelihoods. Shelton (2001), Aberdeen and Baxter (2002) proposed a Rao-Blackwellized version of FSCs where internal-belief states are calculated at each step instead of sampling a single I-state. RaoBlackwellization technique is known to reduce the variance of Monte-Carlo estimators (Casella & Robert, 1996). The internal belief state is a vector bt (., ), where bt (g, ) = P r(gt = g|ht , ), g G. The initial internal belief state b0 is given in the parameters . At each step, the internal belief state is updated by Bayes' Rule as in POMDPs, and used to sample an action to be executed. 2.4. PSR Policies PSRs can be used to represent policies instead of the environment dynamics by switching the roles of actions and observations (Wiewiora, 2005). In fact, a test q can be redefined as an ordered sequence of (observation, action) couples, i.e. q = o1 a1 . . . ok ak . Given the vector of policy parameters, the probability of q starting after ht is redefined as: def time t after observing o is given by: P r(a|ht , o) = hH bt (h, )P r(a|h, o, ) = bT moa t (7) The core histories of a policy are independent from the core histories of the controlled environment. We will use H only to indicate the policy core histories, since no model of the environment is used in our approach. After receiving an observation o and executing an action a, the internal belief state bt (., ) is updated by: bt+1 (h, ) = h H bt (h , )bh oa (h, )moa (h , ) oa h H bt (h , )m (h , ) (8) A PSR policy is defined by: the set of core histories H, and the vectors moa and bhoa , o O, a A, h H. The initial history h0 = is always a core history, thus b0 (, ) = 1 and h H - {} : b0 (h, ) = 0. 3. Reinforcement Learning in POMDPs There are two major families of reinforcement learning algorithms that can be used to find the optimal parameters of a policy. In the first family, a value function is used to estimate the expected long-term reward of each action in every state of the environment (which is a history ht in our case). The parameters of the policy are adjusted so that in each state, the actions with the highest estimated long-term reward will be executed more frequently. In the second family, the optimal policy is gradually learned by searching in the space of parameters. The parameters are updated according to the immediate reward after executing each action, or according to the cumulated reward at the end of each trial. Most Policy Gradient algorithms for POMDPs, such as GAPS (Peshkin, 2001) or GPOMDP (Baxter & Bartlett, 2000) belong to this family. Actor-critic algorithms (Sutton et al., 2000; Peters & Schaal, 2006; Aberdeen et al., 2007) combine the advantages of both value-function and policy search methods. The actor part maintains a parametric policy that is gradually improved according to the evaluation provided by the critic part. The critic learns a value function where the variables correspond to the parameters of the policy. In the following section, we describe a general method where the critic part learns unbiased estimates of immediate rewards which are used to calculate the gradient of the value function with respect to the policy parameters. We show how history-action Q-values are eliminated from the expression of the gradient and replaced by immediate rewards. Finally, we present the gradient expression for both cases where the policy is represented by an FSC and where the policy is represented by a PSR. P r(q a |ht , q o , ) = P r(at+1 = a1 , . . . , at+k = ak |ht , (6) ot+1 = o1 , . . . , ot+k = ok , ) As for the environment dynamics, the probability P r(q a |ht , q o , ) of any test q starting after a history ht is given by a linear combination of the probabilities of the same test q starting after different core histories h H. The internal belief state bt corresponds to the weights of these core histories in P r(q a |ht , q o , ). In particular, the probability of executing action a at Predictive Representations for Policy Gradient in POMDPs 4. A General Actor-Critic Approach The actor is a stochastic policy with a vector of parameters , the critic is a function that returns the gradient of the value function V (h0 , ), which is the total, expected, discounted reward when executing a policy parameterized by after a starting history h0 . V (h0 , ) = aA Thus: V (h0 , ) = i H-1 t t=0 ht {A×O}t aA P r(ha a|ho , ) t t i P r(ho |ha )R(a|ht ) (12) t t Notice that the term P r(ho |ha )R(a|ht ) is independent t t of . It can be learned by a simple Monte Carlo method with Importance Sampling, using a look-up table: ^ t t P r(ho |ha ) = ^ R(a|ht ) = 1 N #ht 1 P r(ha |ho ,) N t t N i=1 ri,t ht ,hi,t a,ai,t+1 P r(a|h0 , )Q(h0 , a, ) where the history-action Q-values are given by: Q(ht , a, ) = R(a|ht ) + oO a A P r(oa |ht a, )Q(ht ao, a , ) (9) 4.1. Gradient Expression The following derivation is an adaptation to POMDPs of the proof of the policy gradient theorem for MDPs, proposed by Sutton et al. (2000). We indicate by ha t the actions of a history ht and by ho its observations. t V (h0 , ) = i i = aA where N is the number of trials, ai,t and ri,t are respectively the action and the reward observed at time t in trial i, hi,t is the history at time t in trial i. Now, we will show how to calculate the other term, P r(ha a|ho ,) t t , depending on the model of the policy. i 4.2. Gradient Estimation for FSCs If we use an FSC to represent the policy, then: o o P r(ha |ho , ) = bT (., )M 1 a1 . . . M t at e t t 0 P r(a|h0 , )Q(h0 , a, ) aA (13) P r(a|h0 , ) Q(h0 , a, ) Q(h0 , a, ) + P r(a|h0 , ) i i where M j j (g, g ) = oj (g, g , )µoj ,aj (g , ) eT = (1, 1, . . . , 1) o a Substituting Q(h0 , a, ) with Equation (9), and repeating the substitution for H steps yields to: V (h0 , ) = i H-1 [ t P r(ht |)Q(ht , a, ) t=0 ht {A×O}t aA Thus: P r(ha |ho , ) t t = i t t [j-1 (g, )j+1 (g , ) P r(a|ht , ) j=1 gG g G ] (10) oj oj ,aj i ( (g, g , )µ (g , )) b0 (g, ) t ]+ 0 (g, ) P r(ha a|ho ,) t t i i Applying Bayes' Rule P r(a|ht , ) = P r(ha |ho ,) gives: gG t t P r(a|ht , ) 1 = i P r(ha |ho , ) i t t a o P r(ht |ht , ) P r(a|ht , ) - i P r(ha |ho , ) t t P r(ha a|ho , ) t t (11) where 0 0 (g, ) = bT (h, ) o o i (g, ) = b0 (., )M 1 a1 . . . M i ai (., g) o a o o t i (g, ) = M i ai (g, .)M i+1 i+1 . . . M t at e t t+1 (g, ) = 1 After replacing this latter formula within Equation (10), and adequately reordering the terms (details are removed due to the lack of space), we find: V (h0 , ) = i H-1 This latter expression of the gradient corresponds to the same expression proposed in (Shelton, 2001). 4.3. Gradient Estimation for PSR Policies t t=0 ht {A×O}t aA P r(ha a|ho , ) P r(ht a|) t t i P r(ha a|ho , ) t t P r(oa |ht a)Q(ht ao, a , )] If we use a PSR to represent the policy, then: o o P r(ha |ho , ) = bT (., )M 1 a1 . . . M t at e t t 0 (14) [Q(ht , a, ) - oO a A where M j j (h, h ) = moj aj (h, )bhoj aj (h , ) eT = (1, 1, . . . , 1) o a R(a|ht ) Predictive Representations for Policy Gradient in POMDPs Therefore: P r(ha |ho , ) t t = i t t [j-1 (h, )j+1 (h , ) j=1 hH h H oj aj t Thus, we have j (g, ) > 0 and j (g , ) > 0 for every I-state g and step i t, the degree of the polynomial P r(ha |ho , ) for the FSC is then equal to 2t + 1. t t (m (h, )bhoj aj (h , )) ] i where 0 0 (h, ) = bT (h, ) To reduce the degree of the value function, Aberdeen and Baxter (2002) proposed reducing the outdegree of the FSC graph and augmenting the number of I-states. However, the convergence can be delayed in that case, given that more parameters should be learned. The main advantage of PSRs comes from the fact that, contrary to I-states, the core histories are contained within the sequence of actions and observations, and no transition probabilities are used to calculate the probability of a core history sequence. Namely, when a prefix sequence o1 a1 . . . oi ai of the history ht = o1 a1 . . . oi ai . . . ot at corresponds to a core hiso o tory hi , then the term bT (., )M 1 a1 . . . M i ai in Equa0 tion (14), can be replaced by a vector i (., ) where i (hi , ) = P r(ha |ho , ) and i (h, ) = 0 for h = hi i i (i is the "unnormalized" belief at time i). We have: P r(ha |ho , ) = P r(a1 |h0 , o1 , ) . . . P r(ai |hi-1 , oi , ) i i From the construction of PSRs (Littman et al., 2002), we know that all the prefixes of a core history are also core histories, therefore: P r(ha |ho , ) = o1 a1 ,h0 o2 a2 ,h1 . . . oi ai ,hi-1 i i This latter term is a polynomial of degree i. Hence, the degree of the polynomial P r(ha |ho , ) (Equation 14) is t t at most equal to 2t - i. Consequently, the degree of the value function polynomial is generally smaller when the policy is represented by a PSR than when it is represented by an FSC, and the number of local optima is also smaller, as we will see in the following example. 5.1. Example Figure 1 shows a simple toy problem with 3 deterministic actions: U (Up), L (Left) and R (Right), and two deterministic observations o1 (lower states) and o2 (upper states) (Figure 1.a). There are also two final states marked with positive rewards. Figure 1.b presents a parametric FSC with three states, g1 is the initial state, g2 and g3 are final states. We focus on only two parameters, since representing functions with more than 2 variables graphically is not feasible. The first parameter 1 corresponds to the probability of going from state g1 to state g2 after observing o2 , 1 - 1 is the probability of going to g3 . The second parameter 2 corresponds to the probability of executing o o i (h, ) = b0 (., )M 1 a1 . . . M i ai (., h) oi+1 ai+1 oi ai o t . . . M t at e i (h, ) = M (h, .)M t t+1 (h, ) = 1 4.4. Updating the Parameters of the Policy The gradient calculated by the critic (Equation 12) is used to update the parameters i : i i + V (h0 , ) i (15) where [0, 1] is the gradient step size. Policy gradient methods are proved to converge to a local optimum. The probability of reaching a global optimum depends on the complexity (or the shape) of the value function. In the following section, we show that the value function is a polynomial, with a smaller degree when the policy is represented by a PSR. 5. Complexity of The Value Function The functions o and µo,a return distributions of probabilities, they are usually represented as softmax functions. The analysis of the value function is more difficult with this representation, since it involves exponentials and fractions. For the purpose of comparing to PSRs, and without loss of generality, we consider each o (g, g ) and each µo,a (g) as a different parameter, so o (g, g , ) = g,o,g and µo,a (g, ) = o,g,a , we also consider b0 (g, ) = g . For PSRs, the vectors bhoa are not stochastic, therefore each bhoa (h ) is a different parameter, so bhoa (h , ) = hoa,h . Similarly, we put moa (h, ) = oa,h . Notice now that the function returning P r(ha |ho , ) t t (Equations 13 and 14) is a multivariate polynomial of degree less than or equal to 2t + 1. The variables of this polynomial correspond to the parameters hoi ai ,h , oj aj ,h for PSR policy, and to g,oj ,g , oj ,g,aj , g for FSC. Unless the structure of the FSC is provided a priori, the graph of the FSC is generally completely connected, i.e. a, o, g, g : o (g, g , )µo,a (g , ) > 0. Predictive Representations for Policy Gradient in POMDPs (a) 1 2 1 start O2 O1 1 Ct : o O, a A : 2 Ct : o O : aA oa hH bt (h, )m (h, ) 0; oa hH bt (h, )m (h, ) = 1; (b) P r(U |o2 ) = 1 - 2 1 1 P r(R|o2 ) = 1 - 2 g2 P r(L|o2 ) = 2 P r(g2 |g1 , o2 ) = 1 g3 P r(U |o2 ) = 2 P r(g3 |g1 , o2 ) = 1 - 1 g1 P r(U |o1 ) = 1 (c) h0 = , H = {}, bO1 U () = 1, mO2 L () = P r(L|O2 , ) = 1 mO2 R () = P r(R|O2 , ) = 2 def def The number of reachable belief states bt is exponential w.r.t. the horizon H, while only the vectors moa and bhoa appear in the parameters of the policy. Thus, we should define another set of constraints on moa and 1 2 bhoa to ensure that {Ct , Ct } will be satisfied after an arbitrary number of bayesian updates. The existence of such constraints depends on the convexity of the bayesian update function inside the convex hull of valid belief states, and up to our knowledge, this problem has not been studied before in the context of PSRs. Nevertheless, we will define a set of weaker constraints on moa and bhoa that can keep the beliefs bt within the hull of valid weights for short horizons: 1 C0 : o O, a A, h H : moa (h, ) 0; 2 C0 : o O, h H : aA moa (h, ) = 1; 3 C0 : o O, a A, h H : h H bhoa (h , ) = 1; 1 2 Constraints C0 and C0 can be satisfied by using a Boltzmann distribution for each core history h and 3 each observation o. Constraint C0 is justified by the fact that: t {0, . . . , H} : hH bt (h, ) = 1. Indeed, for any observation o, we have: hH bt (h, ) , mO2 U () = P r(U |O2 , ) = 3 def (d) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1 0 0 0.2 0.5 0.4 0.6 0.8 1 0 1.5 1 0.5 0 1 1 0.5 0.4 0 0.2 0 0.8 0.6 V F SC (h0 , ) = 1 2 1 2 +(1 - 2 )(1 - 1 ) V P SR (h0 , ) = 1 + 2 1 1 2 1 + 2 Figure 1. The value functions of an FSC (polynomial of degree 2) and its equivalent PSR (polynomial of degree 1). = = = moa (h, ) oa aA hH bt (h, )m (h, ) aA P r(a|ht , o, ) = 1 hH bt (h, ) aA action L in state g2 after observing o2 , and 1 - 2 is the probability of executing action R in g3 . The value function of this FSC for horizon 2 is given by 1 V F SC (h0 , ) = 2 1 2 + (1 - 1 )(1 - 2 ), it has one global maximum of value 1 and one local maximum of value 1 (Figure 1.d). The equivalent PSR policy (Fig2 ure 1.c) can be represented by one core history: , and three parameters: 1 = mO2 L () and 2 = mO2 R () and 3 = mO2 U (). The value function of this policy is given by V P SR (h0 , ) = 1 1 + 2 , it has only 2 a global maximum (Figure 1.d) within the simplex of valid parameters, i.e. 1 + 2 1. This gain is due to the fact that the history h1 = O1 U is equivalent to (all the histories are equivalent to when H = { }), thus bh1 () = 1 (degree 0) and the probability of action L after h1 and O2 , for example, is P r(L|h1 , O2 ) = 1 (degree 1). While for the FSC, we have bh1 (g2 ) = 1 , bh1 (g3 ) = 1 - 1 and P r(L|h1 , O2 ) = 1 2 (degree 2). 3 To satisfy C0 , we project the gradient V (h0 ,) onto hoa,h the hyperplane defined by C3,0 by solving a system of linear equations, the projected values are given by: V (h0 , ) V (h0 , ) 1 - hoa,h 2hoa,h 2|H| h H V (h0 , ) (16) hoa,h 7. Discovery of Core Histories Initially, the set H contains only the empty history , the vectors moa are initialized to uniform distributions over actions a for each observation o, and all the histories are considered as equivalent to , i.e. def 3 oa, = boa (, ) = 1 (because of the constraint C0 ). As the parameters are updated, new core histories are iteratively discovered and added to H. A history hoa (which is an extension of a core history h H) is considered as a core history, if there is at least a test q such that P r(q a |hoa, q o ) cannot be written as a linear combination of the probabilities P r(q a |h , q o ), h H. Since this criteria cannot be verified for every possible test q, we will rather use a set of heuristic measures as an indicator of the predictability of the history hoa. The first one is based on the distance dhoa between the vector hoa , which is updated by using the gra- 6. Constraining PSR Belief States The belief state of a PSR is not a probability distribution, it is rather subject to a set of constraints that should be satisfied at any time t: Predictive Representations for Policy Gradient in POMDPs 0.3 0.25 Average reward per step 0.2 0.15 0.1 0.05 0 0 200 400 600 800 Learning Step PSR policy FSC policy Optimal policy # core histories 10 9 Average Reward per Step 0.3 0.25 0.2 0.15 0.1 0.05 0 0 500 1000 Learning Step 1500 PSR policy FSC policy Optimal policy # core histories 10 9 Number of core histories Number of core histories Number of core histories 8 7 6 5 4 3 2 1 0 1000 8 7 6 5 4 3 2 1 0 2000 Figure 2. 4x4 maze results. 40 30 20 10 0 -10 -20 0 500 1000 Learning Step 1500 PSR policy FSC policy Optimal policy # core histories 10 9 Average Reward per Step 3 2.5 Figure 4. Cheese maze results. PSR policy FSC policy Optimal Policy # core histories 10 9 8 7 6 5 1 0.5 0 -0.5 0 500 1000 Learning Step 1500 4 3 2 1 0 2000 Average Reward per Step Number of core histories 8 7 6 5 4 3 2 1 0 2000 2 1.5 Figure 3. Network results. Figure 5. Shuttle results. dient calculated in Equation (12), and the corrected vector hoa , which is updated by using the projected gradient of Equation (16) (i.e. dhoa is the correction 3 made on hoa to make it satisfy the constraint C0 ), 2 2 dhoa = h H (hoa,h - hoa,h ) The second measure is the entropy eho of the distribution over actions a conditioned on h and o, eho = - aA P r(a|h, o) ln P r(a|h, o) Finally, a history hoa cannot be added to H if it has not been tested enough. Thus, our third measure is ^ P r(hoa|) = #hoa , where N is the number of trials. N A history hoa is considered as a new core history iff: ^ (dhoa d ) (eho e ) (P r(hoa|) p ) where d , e and p are predefined thresholds. The probabilities moa (h , ) of a new core history h are initialized to a uniform distribution, whereas all the extensions of h are considered as equivalent to h , i.e. bh o a (h , ) = 1 and bh o a (h , ) = 0 for h = h . ing PSRs and FSCs as models of the policies. We use softmax functions to represent the distributions over actions in both PSRs (with the vectors moa ) and FSCs (with the functions µoj ,aj ), and to represent the distributions over the next I-states in FSCs (with the functions oj ). We use the same temperature for these functions, is initialized to 0.1 for 4 × 4 maze problem, to 1 for Cheese maze and Shuttle problems, and to 10 for Network problem. For all these problems, is decreasing by a constant factor of 0.999. The number of I-states |G| is the only hyper-parameter for FSCs, it is chosen by a cross-validation for each problem. We found that the best value of |G| is 6 for 4 × 4 maze, 3 for Cheese maze and Shuttle problems, and 8 for Network problem. The parameters of the transition functions oj , used by the softmax function, are randomly initialized to values between 0 and 1. The parameters of the action-selection functions µoj ,aj are initialized to 0 (a uniform distribution). The functions µoj ,aj and oj cannot be all initialized to uniform distributions, this results in uniform beliefs and a zero gradient (Aberdeen & Baxter, 2002). The hyper-parameters of PSR policies are the thresholds p , e , and d (see Section 7). p is set to 0.1 for all the 8. Experiments We test the policy gradient approach described in Section 4 on small standard problems taken from A. R. Cassandra's POMDP data repertoire (on the web), us- Predictive Representations for Policy Gradient in POMDPs problems, while d is set to 0.1 for 4 × 4 maze, to 6 for Network, and to 0 for Cheese maze and Shuttle. e is set to 0 for 4 × 4 maze and Network, and to 1.5 for Cheese maze and Shuttle. Notice that these thresholds are used to control the size of H: using smaller thresholds leads to more core histories and vice versa. They are found by a cross-validation, as for |G| in FSCs. Finally, we use the same constant gradient step for both PSRs and FSCs, is set to 1 for 4×4 and Cheese mazes (problems with small rewards), and to 0.1 for Network and Shuttle (problems with high rewards). Figures 2-5 show the average reward per step of FSC and PSR, and the average number of discovered core histories per step for PSR. The results are averaged over 10 independent runs. For all these problems, the Policy Gradient algorithm converged to locally optimal values for both models of policies. However, we notice that the final value is slightly higher when we use a PSR to represent the policy, which is expected from the theoretical analysis of Section 5. The outperformance of PSR policies is more pronounced in problems with a smaller number of histories (|A||O| = 8 in 4 × 4 maze and Network, 15 in Shuttle, and 28 in Cheese maze). Indeed, discovering the accurate set of core histories becomes more difficult when the branching factor |A||O| is higher. Finally, we notice that the learning algorithm for PSRs converged rapidly in Network problem, this is explained by the fact that all the core histories were discovered within the first 30 trials. points. Particularly, one can combine core histories and core tests, and use the first ones to detect the reset points for the second. The other issue related to infinite-horizons is the stability of the belief states. We defined some constraints that can keep the belief within the hull of valid parameters for short horizons, but the general solution remains an open problem. References Aberdeen, D., & Baxter, J. (2002). Scaling InternalState Policy-Gradient Methods for POMDPs. Proc. 19th Int. Conf. Machine Learning (pp. 3­10). Aberdeen, D., Buffet, O., & Thomas, O. (2007). Policy-Gradients for PSRs and POMDPs. Proc. 11th Int. Conf. Artificial Intelligence and Statistics. Baxter, J., & Bartlett, P. (2000). Reinforcement Learning in POMDP's via Direct Gradient Ascent. Proc. 17th Int. Conf. Machine Learning (pp. 41­48). Casella, G., & Robert, C. P. (1996). Raoblackwellisation of Sampling Schemes. Biometrika, 15, 229­235. Littman, M., Sutton, R., & Singh, S. (2002). Predictive Representations of State. Advances in Neural Information Processing Systems 14 (pp. 1555­1561). Makino, T., & Takagi, T. (2008). On-line Discovery of Temporal-Difference Networks. Proc. 25th Int. Conf. Machine Learning (pp. 632­639). Meuleau, N., Peshkin, L., Kim, K., & Kaelbling, L. (1999). Learning Finite-State Controllers for Partially Observable Environments. Uncertainty in Artificial Intelligence: Proc. 15th Conf. (pp. 427­436). Peshkin, L. (2001). Reinforcement Learning by Policy Search. Doctoral dissertation, Massachusetts Institute of Technology. Peters, J., & Schaal, S. (2006). Policy Gradient Methods for Robotics. Proc. IEEE Int. Conf. Intelligent Robotics Systems (pp. 2219­2225). Shelton, C. R. (2001). Importance Sampling for Reinforcement Learning with Multiple Objectives. Doctoral dissertation, Massachusetts Institute of Technology. Sutton, R. S., Mcallester, D., Singh, S., & Mansour, Y. (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems 12 (pp. 1057­1063). Wiewiora, E. (2005). Learning Predictive Representations from a History. Proc. 22nd Int. Conf. Machine Learning (pp. 964­971). 9. Discussion We have shown that PSRs are alternative to FiniteState Controllers in policy gradient methods for POMDPs. Internal states of PSR policies are based on observable sequences of interacting with the environment, called core histories. Two main advantages result from this property. The first one is related to finite-horizon problems, where the degree of the value function polynomial is reduced by the length of the longest core history observed in a given trial. The second advantage is the possibility of discovering new core histories, based on the predictability of the extensions of previous core histories. We used different heuristics, based on the entropy of the actions distribution and the correction of the gradient, as an indicator of the predictability, but more sophisticated techniques should be investigated (Makino & Takagi, 2008). However, it is unclear how PSR policies will perform in infinite-horizon problems where the value function is defined only on the stationary belief state and the core histories cannot be observed. This problem can be treated by using a mechanism for detecting reset