Constructing States for Reinforcement Learning M. M. Hassan Mahmud hamahmud42@gmail.com School of Computer Science, The Australian National University, Canberra, 0200 ACT, Australia Abstract POMDPs are the models of choice for reinforcement learning (RL) tasks where the environment cannot be observed directly. In many applications we need to learn the POMDP structure and parameters from experience and this is considered to be a difficult problem. In this paper we address this issue by modeling the hidden environment with a novel class of models that are less expressive, but easier to learn and plan with than POMDPs. We call these models deterministic Markov models (DMMs), which are deterministic-probabilistic finite automata from learning theory, extended with actions to the sequential (rather than i.i.d.) setting. Conceptually, we extend the Utile Suffix Memory method of McCallum to handle long term memory. We describe DMMs, give Bayesian algorithms for learning and planning with them and also present experimental results for some standard POMDP tasks and tasks to illustrate its efficacy. tially observable Markov decision process (POMDP), which are (essentially) hidden Markov models with actions. Given the model (structure and parameters) of a POMDP, there exists effective heuristic algorithms for planning (see the survey (Ross et al., 2008)), although exact planning is undecidable in general (Madani et al., 2003). However, in many important problems, the POMDP model is not available apriori and has to be learned from experience. While there are some promising new approaches (e.g. (DoshiVelez, 2009) using HDP-POMDPs), this problem is as yet unsolved (and in fact NP-hard even under severe constraints (Sabbadin et al., 2007)). One way to bypass this difficult learning problem is to consider simpler environment models. In particular, in this paper we assume that each history deterministically maps to one of finitely many states and this state is a sufficient statistic of the history (Mccallum, 1995; Shalizi & Klinkner, 2004; Hutter, 2009). Given this history-state map the environment becomes a MDP which can then be used to plan. So the learning problem now is to learn this map. Indeed, the well known USM algorithm (Mccallum, 1995) used Prediction Suffix Trees (Ron et al., 1994) for these maps (each history is mapped to exactly one leaf/state) and was quite successful in benchmark POMDP domains. However, PSTs lack long term memory and had difficulty with noisy environments and so USM was not followed up on for the most part. In our work we consider a Bayesian setup and replace PSTs with finite state machines and endow the agent with long term memory. The resulting model is a proper subclass of POMDPs, but hopefully maintains the computational simplicity and efficiency that comes with considering deterministic history state maps. We note that belief states of POMDPs are also deterministic functions of the history. But this state space is infinite and so POMDP models learning algorithms try to estimate the hidden states (see for instance (Doshi-Velez, 2009)). As a result, these methods are quite different from algorithms using deterministic history-state maps. Other notable meth- 1. Introduction In this paper we derive a method to estimate the hidden structure of environments in general reinforcement learning problems. In such problems, at each discrete time step the agent takes an action and in turn receives just an observation and a reward. The goal of the agent is to take actions (i.e. plan) to maximize its future time-averaged discounted rewards (Bertsekas & Shreve, 1996). Clearly, we need to impose some structure on the environment to solve this problem. The most popular approach in machine learning to do this is by assuming that the environment is a parAppearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Constructing States for Reinforcement Learning ods for learning the environment model include PSRs (Littman et al., 2002) ­ unfortunately we lack space and do not discuss these further. Superficially, finite state controllers for POMDPs (Kaelbling et al., 1998) seem closely related to our work but these are not quite model learning algorithms. They are (powerful) planning algorithms that assume a hidden but known environment model (at the very least, implicitly, an estimate of the number of hidden states). We now proceed as follows. We define a general RL environment and then our model, the deterministic Markov model (DMM), and show how to use it to model the environment, infer the state given a history and compute the optimal policy. We then describe our Bayesian inference framework and then derive a Bayesian, heuristic Monte-Carlo style algorithm for model learning and planning. Finally, we describe experiments on standard benchmark POMDP domains and some novel domains to illustrate the advantage of our method over USM. Due to lack of space formal proofs of our results are given in (Mahmud, 2010). Our focus here is on motivating our approach via discussion and experiments. and denotes the empty history. 2.2. General RL Environments A general RL environment, denoted by grle, is defined by a tuple (A, R, O, RO, ) where all the quantities except RO are as defined above. RO defines the dynamics of the environment: at step t when action a is applied, the next reward×observation is selected according to probability RO(ro|h, a) where h = ao0:t-1 is the history before step t. We will write the marginals over R and O w.r.t. RO by R and O respectively. r= { s0 0/0.4 1/0.6 r= { s1 0/0.2 1/0.8 r= { 0/0.9 1/0.1 {ao'}/0.7 {ao'',ao}/0.3 {ao',ao}/0.5 {ao''}/0.5 s2 {ao,ao'}/0.56 {ao}/0.44 Figure 1. A DMM with A = {a}, R := {0, 1} and O := {o, o , o }. The edges are labeled with z A × O that cause the transition along with the probability for that transition (parameters s,a ). The reward distributions for each state appear above each state (parameters s,a ). 2. Modeling Environments To recap, we model a general RL environment by our model, the DMM, and then use the MDP derived from the DMM to plan for the problem. In the following we introduce notation (Sect. 2.1), define a general RL environment (Sect. 2.2), define our model, the DMMs (Sect. 2.3) and show how they can model the environment and construct the requisite MDP to plan with (Sect. 2.4). We then describe the DMM inference criterion and the learning algorithm in Sect. 3 and 4. 2.1. Preliminaries EP (x) [f (x)] denotes the expectation of the function f with respect to distribution P . We let A be a finite set of actions, O a finite set of observations and R IR a finite set of rewards. We set H := (A × O) to be the set of histories and [0, 1) to be a fixed discount rate. We now need some notation for sequences over finite alphabets. x0:n will denote a string of length n + 1 and x rnd set cur to prop . a ¯¯0:n 6: end for 7: Return cur . transitions back to an ancestor (loop, Alg. 4). The modification is then accepted as the estimate of g in the next iteration if it satisfies a Metropolis-Hastings (MH) style condition (line 5). The condition contains a parameter which was set to 25 in all our experiments (this simulates the proposal distribution ratio in the MH acceptance probability). Algorithm 4 Function for Looping Edges. function: search(, aro0:n ) ¯ ¯¯ 1: Sample a leaf node/state s according to frequency in s0:n (¯k = (q , ao0:k )). ¯ s ¯¯ 2: Sample an outgoing, unextended edge ao at s ac- Since this is in fact often the case in RL problems, we need to put significant bias in our search algorithm. Algorithm 3 Function for Extending Edges function: extend(, aro0:n ) ¯ ¯¯ 1: Sample a leaf node/state s according to frequency 2: 3: 4: 5: 6: cording to the empirical (w.r.t. aro0:n ) next step ¯ ¯¯ reward at ao. 3: Choose a ancestor s of s according 2-m /Z where m is the no. of ancestors of s between s and s . 4: Set (s, ao) to s . 5: Return . We now discuss the heuristics used in extend and loop to choose nodes to modify. In line 1 of both procedure, we sample states according to its frequency in the data because we expect the impact to the likelihood to be the greatest if we modify high frequency states. Then in line 2 of both we sample an outgoing edge ao of the state s from line 1 according to the average reward seen at s after taking action a and observing o. The intuition is that if the reward at that edge is high, then it might be a good idea to create a new state there (in extend ) or move it to a different state (in loop ) to get a better model. A more sophisticated version of this heuristic would be to use the future discounted reward at the edge. In lines 3 and 4 extend constructs the new state for the chosen edge. In line 3 loop chooses an ancestor to loop back to, choosing a closer ancestor with higher probability. This is because in typical POMDP domains we expect the state to transition back to states visited very recently. loop then loops the edge in line 4. in s0:n (¯k = (q , ao0:k )). ¯ s ¯¯ Sample an outgoing edge ao at s according to the empirical (w.r.t. aro0:n ) next step reward at ao. ¯ ¯¯ Create a new state snew with context xs ao where xs is the context of s (the context of root is empty). Set (s, ao) to snew ; set (snew , a o ) = arg max{length(xs )|xs is a suffix of xsnew }. Add snew to the state of . Return . We introduce bias by restricting our algorithm to a particular class of DMMs. Each DMM in this class is a tree, such that outgoing edges at each node can go only to a child or an ancestor. Only leaf nodes are allowed to transition to other nodes that are not its descendants ­ see Fig. 2. This is similar to a looping suffix tree (Holmes & Isbell-Jr., 2006), but our model is not a suffix tree structure ­ each node in the graph is in fact a state. It is straightforward to show that all DMMs have such a loopy representation although with possibly exponentially many states. For the typical POMDP domains this representation seems particularly suitable, as dynamics in such domains have a neighborhood structure, where we are only able to transition back to states that were visited recently. Now our search algorithm (search, Alg. 2) is a stochastic search over this space with the posterior (10) as the selection criterion. search at each step performs one of two possible operations. It chooses a leaf node, and either it extends it by adding another leaf node (extend, Alg. 3); or it chooses an internal or a leaf node and loops one of its un-extended 5. Experiments 5.1. Setup We ran experiments to illustrate that we do in fact extend McCallum's Utile Suffix Memory in desirable ways. We first ran three experiments on three POMDP maze problems to evaluate our model on standard problems. Then we ran experiments on three novel, non-episodic domains to test the long term memory ability of our method. The domains for the first are given in Fig. 3 (numbered 1,2, and 3 respectively) and they have the same dynamics as in (Mccallum, 1995) Constructing States for Reinforcement Learning S + S + + + + + + + + + + + + + + + + + S G Half Cheese Maze (3 x 7) + + + + + + + + G + + + + + + + Full Cheese Maze (5 x 7) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + G Bigger Maze (8 x 9) and [7, (2, 2, 2, 2, 2, 2, 2)] (referred to as harvesting problems 1, 2 and 3 respectively from now on). The POMDP representation of these problems have, respectively, 4 · 24 = 64, 5 · 25 = 160, 7 · 27 = 896 states. In each of these cases, we let each USM and DMM run for 3000 steps and report the total accumulated reward for each method averaged over 10 trials. 5.2. Results The results for the three maze domains are given in Fig. 4 we report the median time per trial to get to the goal state to be comparable to (Mccallum, 1995) ­ the performance of our implementation of USM matches that of (Mccallum, 1995). In these experiments in Alg. 1, T is variable, equal to the number of steps the agent reaches the goal, while N := 30; and K := 5000. As can be seen, our algorithm more or less matches USM, except at the beginning where the number of steps is much higher. This is because our model is trying to learn a recursive model and hence gets confused when there is little data. Unsurprisingly, the average number of states estimated by our method was 45.3, 67.5 and 90.5 compared to 204.5, 231.5 and 623.5 for USM. Similarly, our stochastic search took significantly longer off-line processing to learn (5 minutes compared to 10s of seconds for USM). So for these types of episodic tasks, it might be better to first learn a model using USM and then compress it to a DMM (Shalizi & Klinkner, 2004). Regardless, the experiments give evidence that DMMs are viable for modeling hidden RL environment structure , and that our inference criterion (7) is correct. The results for the 3 harvesting/counter domains are in Fig. 5. Here the difference between the two methods become quite significant, with our method dominating USM completely. The differences in computational time was quite similar to that in the previous set of experiments. The average number of states inferred differed by about 400, with the USM continually creating new states. Figure 3. The three POMDP mazes. S is the start state and G is the goal state. with state transition and observation noise. The domains for the second type of experiments are a set of `harvesting' tasks with m crops. A crop i become ready to be harvested after being developed through ki different phases. It is ready to be harvested for just 1 time step once development is complete and then it spoils immediately, requiring to be grown again. The agent has m + 1 different actions. Action ai develops crop i to its next phase with probability 0.7 and develops some other crop with probability 0.3. The remaining action allows the agent to harvest any crop ready to be harvested at that step. The observation at each step is the crop that has been developed at that step. So essentially this is a counting task where the agent has to remember which phase each crop is at ­ so a finite history suffix is not sufficient to make a decision. The agent receives -1.0 reward for each action, and 5.0 reward for a successful harvest. There are no episodes and the each problem instance essentially runs forever. 350 300 No. of Steps to Goal 250 200 150 100 50 0 0 5 10 15 Trial No. USM 1 USM 2 USM 3 DMM 1 DMM 2 DMM 3 6. Conclusion 20 25 30 Figure 4. Results for the maze domains; USM i is the number of steps to goal state for maze i at the given trial. Similarly for DMM i etc. We used several different values of [m, (k1 , k2 , · · · , km )] : [4, (2, 2, 2, 2)], [5, (2, 2, 2, 2, 2)] In this paper we introduced a novel model for learning hidden environment structure in RL problems. We showed how to use the models to approximate the environment and compute optimal policies. We further introduced a novel inference criterion, a heuristic algorithm to learn and plan with these models. We then performed experiments to show the potential of this approach. The avenues for future research are many. The most important at this time seems to be devel- Constructing States for Reinforcement Learning 1000 800 Total Cumulative Reward 600 400 200 0 -200 -400 -600 -800 0 10 USM 1 USM 2 USM 3 DMM 1 DMM 2 DMM 3 Holmes, M. P. and Isbell-Jr., C. L. Looping suffix treebased inference of partially observable hidden state. In Proceedings of the 23r d International Conference on Machine Learning, 2006. Hopcroft, J. E., Motwani, R., and Ullman, J. D. Introduction Automata Theory, Languages and Computation. Addison Wesley, 3rd edition, 2006. Hutter, M. Feature markov decision process. In Proceedings of the 2nd AGI Conference, 2009. Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99 ­ 134, 1998. Littman, M., Sutton, R., and Singh, S. S. Predictive representations of state. In Proceedings of the 15t h Neural Information Processing Systems Conference, 2002. Madani, O., Hanks, S., and Condon, A. On the undecidability of probabilistic planning and related stochastic optimization problems. Artificial Intelligence, 147(1-2):5­34, 2003. Mahmud, M. M. H. A deterministic probabilistic approach to modeling general rl environments. Technical report, Australian National University, 2010. Mccallum, A. Instance-based utile distinctions for reinforcement learning. In Proceedings of the 12th International Machine Learning Conference, 1995. Ron, D., Singer, Y., and Tishby, N. Learning probabilistic automata with variable memory length. In Proceedings of the 7th Conference on Computational Learning Theory, 1994. Ross, Stephane, Pineau, Joelle, Paquet, Sebastian, and Chaib-draa, Brahim. Online planning algorithms for POMDPs. Journal of Artificial Intelligence Research, 32:663­704, 2008. Sabbadin, R., Lang, J., and Ravaonjanahary, N. Purely epistemic markov decision process. In Proceedings of the 22nd National Conference on Artificial Intelligence, 2007. Shalizi, C. and Klinkner, K. Blind construction of optimal nonlinear recursive predictors for discrete sequences. In Proc. of the 20th Conference on Uncertainty in Artificial Intelligence, 2004. Vidal, E., Thollard, F., Higuera, C., Casacuberta, F., and Carrasco, R. C. Probabilistic finite-state machines part I. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(7):1013­ 1025, 2005. 20 30 Number of steps X 100 40 50 Figure 5. Results for the mining domains. USM i is the total reward accumulated by USM on mining problem i at the given step. Similarly for DMM i etc. oping a principled algorithm for learning these models that will apply across many different tasks. Another important generalization would be to consider extending DMMs to more structured, possibly relational, state spaces. It would also be interesting to extend the notion of DMMs to consider arbitrary measurable state spaces and see how the inference criterion works in that situation. Acknowledgments We would like to thank John Lloyd and Marcus Hutter for many useful comments and fruitful discussions. The research was supported by the Australian Research Council Discovery Project DP0877635 "Foundation and Architectures for Agent Systems". References Bertsekas, D. P. and Shreve, S. E. Stochastic Optimal Control: The Discrete-Time Case. Athena Scientific, paperback edition, 1996. Chipman, H. A., George, E. I., and McCulloh, R. E. Bayesian CART model search. Journal of the American Statistical Association, 93:935 ­ 948, 1998. Chrisman, L. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the Tenth National Conference on Artificial Intelligence, pp. 183­188, 1992. Doshi-Velez, F. The infinite partially observable markov decision process. In Proc. of the 20th Neural Information Processing Systems Conference, 2009.