Model-Free Reinforcement Learning as Mixture Learning Nikos Vlassis vlassis@dpem.tuc.gr Technical University of Crete, Dept. of Production Engineering and Management, 73100 Chania, Greece Marc Toussaint TU Berlin, Franklinstr 28/29 FR6-9, 10587 Berlin, Germany mtoussai@cs.tu-berlin.de Abstract We cast model-free reinforcement learning as the problem of maximizing the likelihood of a probabilistic mixture model via sampling, addressing both the infinite and finite horizon cases. We describe a Stochastic Approximation EM algorithm for likelihood maximization that, in the tabular case, is equivalent to a non-bootstrapping optimistic policy iteration algorithm like Sarsa(1) that can be applied both in MDPs and POMDPs. On the theoretical side, by relating the proposed stochastic EM algorithm to the family of optimistic policy iteration algorithms, we provide new tools that permit the design and analysis of algorithms in that family. On the practical side, preliminary experiments on a POMDP problem demonstrated encouraging results. lihood function of an appropriately constructed mixture model. This likelihood can then be maximized by probabilistic inference techniques like EM or Markov Chain Monte Carlo (Hoffman et al., 2008). We show in this paper that when the model of the MDP is unavailable we can use a stochastic EM algorithm to optimize the policy based on trajectory samples only. In particular, we use the Stochastic Approximation EM (SAEM, Delyon et al., 1999) where the E-step is performed by sampling from the MDP with importance weights that reflect rewards. However, a direct translation of the model of Toussaint and Storkey (2006) to model-free RL leads to an inefficient sampling algorithm that uses only the reward at the last time step of a sampled trajectory. We propose an alternative formulation that ensures that all rewards observed along a trajectory are taken into account in the mixture learning. In the tabular case (discrete states and actions), the proposed SAEM algorithm turns out to be identical to a non-bootstrapping optimistic policy iteration algorithm similar to Sarsa(1). Optimistic policy iteration (OPI) is an instance of approximate policy iteration where the policy is updated before it is completely evaluated (Bertsekas & Tsitsiklis, 1996; Tsitsiklis, 2002). In OPI, one or more complete or partial trajectories are generated according to the current policy, the observed rewards are then used to update that policy, and so forth. Of particular interest are nonbootstrapping OPI methods that are based on MonteCarlo partial policy evaluation, because these methods can be directly applied to partially observable domains (POMDPs) as well. The convergence of OPI is not always guaranteed, even for tabular MDPs, and it depends on various factors including the choice of trajectories for simulation, the amount of bootstrapping used, the frequency of policy updates, and the particular form of the policy update operator (Tsitsiklis, 2002). Several counter-examples 1. Introduction Reinforcement Learning (RL) is the problem of learning to control a stochastic dynamical system (or an agent interacting with its environment) by simulation and trial-and-error (Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998). RL methods hold great promise in learning controllers for systems with unknown dynamics, and they have recently demonstrated impressive results (Abbeel et al., 2007; Kober & Peters, 2009). In this paper we describe a reduction of model-free RL to a problem of likelihood maximization in a mixture model. Our approach is based on the work of Toussaint and Storkey (2006) who showed that the value function of a known MDP is proportional to the likeAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). have been given in the literature that demonstrate possible non-convergent behavior of OPI (Bertsekas & Tsitsiklis, 1996; Gordon, 1996). For application of OPI methods in POMDPs the theoretical results are scarce. If the policy update operator is a sufficiently smooth function of the action values, then we know that value and policy fixed points do exist (Perkins & Pendrith, 2002), and convergence to a fixed point is guaranteed if the policy is fully evaluated before it is updated (Perkins & Precup, 2003). A similar result has been established by Melo et al. (2008). However, the general convergence of non-boostrapping OPI algorithms is still open. By relating the proposed SAEM algorithm to the family of OPI algorithms like Sarsa, we provide new tools that permit the design and analysis of OPI algorithms. We report some preliminary experiments in which SAEM demonstrates encouraging results on a standard POMDP problem. tional distribution over X p(|T ; ) = p(, T ; ) , p(T ) (3) and a marginal distribution over X p(; ) = T =0 p(|T ; )p(T ). (4) Note that both are properly normalized over the whole space X . When rewards are scaled to [0, 1] we can introduce an auxiliary binary random variable R with p(R = 1 | uT , xT ) = r(uT , xT ), (5) an idea suggested earlier in other contexts (Cooper, 1988; Dayan & Hinton, 1997; Peters & Schaal, 2008). If we further define p(R = 1|) = p(R = 1|u|| , x|| ) (6) 2. Value function as mixture likelihood Consider an infinite-horizon MDP on the random variables of state xt and action ut as defined by the start distribution p(x0 ), stationary transition probability p(xt+1 | ut , xt ), and a deterministic reward signal rt r(ut , xt ). Given a stochastic policy (ut |xt ) we define the value V () as the expected discounted future reward: we get the following joint distribution over R and X : p(R, ; ) = p(R|) p(; ). (7) If we assume a geometric prior p(t) = (1 - ) t then we can rewrite the value function (1) as V () = T =0 T E rT ; T T =0 V () = E t=0 rt ; , t (1) = p(uT , xT ; ) r(uT , xT ) uT ,xT where (0, 1) is a known discount factor. Our task is to find a stationary policy that maximizes V (). Toussaint and Storkey (2006) showed that the value function can be expressed as a likelihood in an infinitecomponent mixture model. To express this formally we use the notation = (x0 , u0 , .., xT , uT ) for a stateaction trajectory of length || = T . Let X T = { : || = T } be the space of length-T trajectories and X = T the space of arbitrary length trajectories. T =0 X For a given MDP and policy we define the joint probability distribution over X and T as, T -1 = = 1 1- 1 1- p(T ) T =0 X T p(|T ; ) p(R = 1|) (8) p(; ) p(R = 1|). X p(, T ; ) = (uT |xT ) t=0 p(xt+1 |ut , xt ) (ut |xt ) (2) · p(x0 ) ||T p(T ) . That is, we have related the value function V () to the mixture likelihood p(R = 1; ) in the joint distribution p(R, ; ), where the variable-length trajectory X is the latent variable. The model (8) can also be viewed as a generative process for the auxiliary random variable R: first sample a time step T from p(T ), then sample a length-T trajectory p(|T ; ) from the MDP following policy , and finally sample the binary random variable R from p(R|uT , xT ). The probability that R = 1 is then proportional to the value V (). Modeling the value function as a mixture likelihood offers the possibility to use probabilistic inference techniques for optimization, like the EM algorithm (Toussaint & Storkey, 2006) or Markov Chain Monte Here, ||T is one for || = T and zero otherwise, and p(T ) is a prior over the trajectory length. From this joint probability distribution we can define a condi- Carlo (Hoffman et al., 2008). In particular, in the EM algorithm we compute a parameter that locally maximizes the likelihood p(R; ) by iteratively maximizing a lower bound of log p(R; ) (Neal & Hinton, 1998). Let q() be a distribution over the latent variable X . Consider the function F (q, ) = log p(R; ) - D q() p(|R; ) = X (9) q() log p(R|) + log p(; ) + H(q), (10) of the likelihood function under reasonable conditions. An attractive feature of SAEM is that convergence is guaranteed even when the sample size is kept constant over successive iterations of the algorithm, as opposed to other stochastic EM versions that may need to tune the sample size in order to ensure convergence. Additionally, in the context of our RL application the SAEM algorithm can be given a familiar interpretation in terms of state-action values, as we will see in later sections. The SAEM algorithm iterates between steps (Delyon et al., 1999): In iteration k we three where D(·) is the Kullback-Leibler divergence between two distributions, and H(·) is the entropy of a distribution. We start with an initial guess of the parameter , and then we alternate between an E-step and an M-step. In the E-step we find a distribution q that maximizes F (q, ) for fixed , using (9). The optimal q that minimizes the Kullback-Leibler divergence in (9) is the posterior over the latent variables given the observed data (in our case the hypothetical R = 1) and for given parameters old : q () = p( | R; old ). 1. generate m samples i , i = 1, . . . , m, from p( | R; k-1 ), where k-1 is the parameter from the previous iteration, ^ 2. update a function Fk () according to k ^ ^ Fk () = (1 - k )Fk-1 () + m 3. and assign new parameters ^ k = argmax Fk (). m log p(i ; ), (12) i=1 (11) In the M-step we find a new parameter that maximizes F (q , ) for the optimal q , using (10). Often this maximization can be performed analytically. If the model of the MDP is available then the EM algorithm is similar to ordinary policy iteration: the E-step involves a full sweep of the state-action space in order to compute q , which is analogous to policy evaluation. The M-step maximizes F (q , ) over the policy parameters , which is analogous to policy improvement. In this paper we are interested in the model-free case, in which we do not have access to the transition model of the MDP. In this case we cannot perform exact inference in the E-step, and hence we have to resort to sampling, giving rise to a stochastic EM algorithm. (13) To understand this scheme, first assume that k = 1 m 1 ^ in each iteration. Then Fk = m i=1 log p(i ; ) is simply the sample based approximation of the function F (q, ) in (10) (neglecting terms independent of ) and the 3rd step is the standard M-step based on this approximation. Special about the SAEM algorithm is that it uses a learning rate k , with 0 < k < 1, and thereby smoothes the approximation of F (q, ) over several iterations, which is a key for its convergence. In the case of our model p(R, ; ) (equation (7)) we can use importance weights for the sampling in step 1: we can generate samples from p( | R = 1; k-1 ) p(; k-1 ) p(R = 1|) (14) 3. Stochastic Approximation EM In the model-free case we propose to use a stochastic EM algorithm in which the E-step posterior distribution is estimated by Monte Carlo sampling from the MDP, and hence no explicit knowledge of the MDP model is needed. There are several variants of stochastic EM in the literature that differ in the details of the Monte Carlo approximation, particularly in the number of Monte Carlo draws per iteration, and in their conditions for convergence (Celeux & Diebolt, 1985; Wei & Tanner, 1990; Delyon et al., 1999). In our work we use the Stochastic Approximation EM algorithm (SAEM, Delyon et al., 1999) that is easy to implement and guarantees convergence to a local maximum by first sampling m variable-length trajectories i p(; k-1 ) from the MDP using the previous policy k-1 , and then assigning them importance weights wi = p(R = 1|i ). (15) Note that in this case the importance weights correspond to the terminal rewards r(u|i | , x|i | ) for each ^ trajectory i . Accordingly, the function Fk in step 2 reads k ^ ^ Fk () = (1-k )Fk-1 ()+ m m wi log p(i ; ). (16) i=1 For particular choice of policy parameters , the func^ tion Fk can be maximized analytically as we will see later on. An advantage of SAEM over other stochastic EM algorithms is that step 2 is effectively making use of all simulated data when optimizing for the current k . Translated in our RL problem, this means that the optimal policy k at step k depends on trajectories and rewards observed when following policies other than k at earlier steps. This feature makes SAEM similar to on-policy control algorithms like Sarsa (Sutton & Barto, 1998) that iteratively estimate Q-functions using rewards obtained by following previous policies. The SAEM algorithm is guaranteed to converge to a local maximum of the likelihood function for models in the exponential family, under general conditions that include the standard conditions of stochastic approximation for k (Delyon et al., 1999). However, a drawback of using the model p(R, ; ) ^ in (7) is that the function Fk () in (16) is making use only of a single reward per trajectory (the terminal one), and therefore a lot of observed rewards are wasted. The result is that a huge set of trajectories ^ would be needed in order for Fk () to contain useful information for computing the new k . Although the SAEM framework described above justifies the use of a single reward per trajectory for deriving a convergent on-policy control algorithm, such an algorithm would be of little practical significance as it would require an unrealistically large data sample for producing meaningful results. (We have empirically verified this in practice.) In the next section we show that we can express the value function using a different mixture model, in which the M-step of the corresponding SAEM algorithm uses all intermediate rewards in a trajectory. Proof. We prove the theorem by explicitly constructing the parametrized family of distributions a(·) and b(·). The family will be parametrized by a scalar with < < 1. We define a(t) = (1 - ) t , b(t) = (1 - /)(/) , t (19) (20) for t = 0, . . . , . Since p(R = 1|, T ) = 0 for || > T , the right hand side of (17) reads: T a(T ) T =0 t=0 b(t) X t p(|t; ) p(R = 1|ut , xt ), (21) where we have truncated the t-summation at time T . The term X t p(|t; ) p(R = 1|ut , xt ) is by definition the expectation E[rt ; ] (as in (8)), and the right hand side of (17) reads T a(T ) T =0 t=0 b(t) E[rt ; ] T (22) (/)t E[rt ; ] t=0 = (1 - )(1 - /) T =0 T (23) (24) = (1 - )(1 - /) t=0 (/)t E[rt ; ] T =t T = (1 - )(1 - /) t=0 (/)t E[rt ; ] t T =0 T (25) (26) = (1 - /) t=0 t E[rt ; ] = (1 - /)V (), and hence (17) holds with proportionality constant 1/(1 - /). Let us contrast this to the previous mixture model (equations (7) and (8)). In that model we had the joint p(R, ; ) over R and X , where the length of was distributed according to p(T ) T (equation (4)). What we have derived in the above theorem corresponds to a joint distribution over R, t, X t , and a new random variable T : p(R, t, , T ; ) = a(T ) b(t) p(|t; ) p(R|, T ), (27) 4. An alternative mixture model for V We show here that the value function V () can be expressed in terms of an alternative mixture model: Theorem 1. There exists a parametrized family of geometric distributions a(t) and b(t) for which the value function is proportional to V () T =0 a(T ) t=0 b(t) X t p(|t; ) p(R = 1|, T ), (17) with p(|t; ) defined in accordance to (3). The new variable T can be thought of as providing a `maximal' trajectory length in the sense that, for given T , only trajectories with length shorter than T have nonzero probability when conditioned on R = 1. Effectively, the T variable implies an additional discounting so that the geometric priors a(T ) and b(t) together induce the with p(R = 1|, T ) = p(R = 1|u|| , x|| ) if || T . 0 otherwise (18) appropriate discounting. As the theorem shows, the mixture likelihood p(R = 1) is also in the new model proportional to the value function V (). When (but strictly > ) the distribution b(t) is nearly constant (decays very slowly) and the distribution a(T ) approximates p(T ). In this case, using expression (21), we have T ^ The update of Fk in step 2 of SAEM reads: ^ ^ Fk () = (1 - k )Fk-1 ()+ k m m i=1 1 |i | + 1 |i | wit log p(it |t; ). (31) t=0 Note that this update takes into account all rewards observed during a trajectory i . 5.1. The finite horizon case We can also derive a SAEM algorithm for the finite horizon case, in which the value function is the expected sum of rewards up to a terminal time step H: H V () T =0 a(T ) t=0 T b(t) X t p(|t; ) p(R = 1|ut , xt ) = T =0 a(T ) t=0 b(t) X T p(|T ; ) p(R = 1|ut , xt ) T = T =0 a(T ) X T p(|T ; ) t=0 T b(t) p(R = 1|ut , xt ) V () = E rt ; . t=0 (32) T =0 p(T ) X T p(|T ; ) t=0 p(R = 1|ut , xt ). (28) In the second line we could extend the summation over X t to X T since the additional summation over (xt+1:T , ut+1:T ) sums to 1. Equation (28) corresponds to the stochastic shortest path formulation of an infinite-horizon value function (Bertsekas & Tsitsiklis, 1996, p. 39), and our theorem can be regarded as a generalization of that formulation. 5. Stochastic approximation EM in the new model We derive here a SAEM algorithm for the new mixture model (17). Contrary to section 3, we now have three latent random variables T a(T ), t b(t), and X t . In analogy to (14), in step 1 of SAEM we need to sample from the posterior p(t, , T | R; k-1 ) a(T ) b(t) p(|t; k-1 ) p(R|, T ) . (29) To sample from this posterior we first sample T , then t, then X t . The advantage of sampling in this order is that, since we need to sample only trajectories with nonzero posterior, we can constrain the range of t to 0, .., T . Further, instead of sampling independent trajectories of different lengths t, we can more efficiently reuse the samples: For given Ti , we generate a single trajectory i of length Ti and treat all of its subtrajectories (length-t prefixes of i ) as samples from p(|t; k-1 ), for t = 0, .., Ti . We denote the length-t prefix of i as it . To each of these sub-trajectories it we then assign importance weight wit = b(t) p(R = 1|it , T ). (30) In the original model of section 2 this value translates to a likelihood, analogous to equation (8), when we choose the time prior p(T ) = 1/(H+1) for T H and zero otherwise. Again, if we derive a SAEM algorithm directly for this original mixture model this leads to an inefficient use of samples. However, we note that the finite horizon case can alternatively be expressed by choosing in equation (17) a delta distribution a(T ) = T H and a uniform distribution b(t) = 1/(H+1) for t H, and derive the SAEM algorithm as above. This leads to exactly the same algorithm as above, except that we always choose length T = H to generate the ^ sample trajectories i . In this case the update of Fk reads: ^ ^ Fk () = (1 - k )Fk-1 ()+ k m(H + 1) m H wit log p(it |t; ), (33) i=1 t=0 where the importance weights now correspond to immediate rewards: wit = p(R = 1|it , T ) = p(R = 1|u|it | , x|it | ). 5.2. The tabular case Here we consider the classical tabular case, where the state and the action spaces are discrete and the policy is parametrized by the conditional probability table ux of taking action u at state x, with u ux = 1 for all x. For convenience we can write the policy using an indicator function: (ut |xt ) = exp ux (34) I(xt = x, ut = u) log ux . (35) Then the log probability of a sub-trajectory it under the MDP (using (2) and (3), and ignoring terms that do not depend on ) is log p(it |t; ) = const. + ux cux log ux , it (36) where cux is the number of occurrences of a stateit action pair (u, x) in the trajectory it , and thereby the number of occurrences of (u, x) in the sample i until ^ time t. From equation (31) the update of Fk reads (ignoring constants): ^ ^ Fk () = (1 - k )Fk-1 ()+ k m m smooth (non-greedy) policy improvement step (40). The algorithm is based on batch every-visit Monte Carlo policy evaluation and involves no bootstrapping, hence it is also appropriate for domains that exhibit partial observability (POMDPs). Its convergence can be guaranteed under quite general conditions (Delyon et al., 1999), but we have not attempted to provide any formal proofs in this paper. 6. Experiments For illustration purposes we first demonstrate the proposed SAEM algorithm on the `chain' toy MDP (Dearden et al., 1998). This MDP is shown in Fig. 1. The process always starts from state 1, every chosen action can be flipped with probability 0.2 (hence the state transitions are effectively stochastic), and the discount factor is = 0.99. The optimal policy is to always take action `a' from any state. (log ux ) ux i=1 1 |i | + 1 |i | wit cux . (37) it t=0 In step 3 of the SAEM algorithm we need to find ^ the argmax Fk (). Let us define another function Qk (x, u) over all x, u pairs as: Qk (x, u) = (1 - k )Qk-1 (x, u)+ k m m i=1 1 |i | + 1 |i | wit cux . (38) it t=0 This function can be regarded as an action-value function: since the importance weights wit are proportional to immediate rewards (and they are approximately equal to the immediate rewards when b(t) is nearly constant, e.g., when or in the finite horizon case), the last term in (38) is effectively performing a partial evaluation of the most recently followed policy (k-1 ) in an approximate batch every-visit Monte Carlo manner. This term is combined with the action value function Qk-1 of the previous EM iteration to produce the new Qk estimate, and so forth for each k. Using the definition of the Q-function, the function ^ Fk () reads ^ Fk () = ux Figure 1. Learning curve of SAEM on the `chain' MDP. The dashed line corresponds to the optimal value V (1). Qk (x, u) log ux . (39) ^ Maximization of Fk () using a Lagrange multiplier x ( u ux - 1) for each state x gives k (x, u) Qk (x, u). (40) The smoothness of the policy update operator as a function of the action values has been key in establishing the convergence of related approximate policy iteration algorithms (Jaakkola et al., 1995; Perkins & Precup, 2003). In summary, SAEM is an optimistic policy iteration algorithm for RL that is similar to Sarsa(1), with a In Fig. 1 we show the learning curve of SAEM (mean and standard deviation of value, over five runs) when initializing with a uniformly random policy. In each iteration of SAEM we sampled m = 50 trajectories, and used = , which means that the length of each trajectory was sampled from a geometric distribution with parameter , and b(t) is constant. We evaluated each policy by standard MDP policy evaluation using the model. Using the above settings the algorithm was consistently able to locate the optimal policy in all runs. For smaller sample sizes (e.g., m = 10) the algorithm would often converge to suboptimal policies corresponding to local maxima of the likelihood function (not shown in the graph). Since SAEM involves no bootstrapping, it can also be applied on POMDPs. We applied the algorithm on the Hallway POMDP, a standard problem from the POMDP literature (Littman et al., 1995). This is a robot maze problem with partial observability, where the robot always starts from an unknown state drawn from a fixed starting distribution and must reach a goal state. This POMDP involves 60 states, 21 observations, and 5 actions. Here = 0.95. In this experiment we used a stochastic memoryless policy for the agent, which can be regarded as a finite state controller with as many nodes as the number of observations.1 In Fig. 2 we show the learning curves of SAEM for sample sizes m = 1000 and m = 100. In both cases we executed five runs, starting with a uniformly random policy, and using = 0.99. For m = 1000 we plot the mean and standard deviation of the value of each discovered policy (thick line with error bars), and for m = 100 we just show the five runs (dashed lines). The value of a policy was computed by standard model-based policy evaluation of finite state controllers (Hansen, 1998). the potential of SAEM to handle domains with partial observability.2 7. Conclusions In this paper we reformulated model-free Reinforcement Learning as a probabilistic inference problem (mixture learning). The approach draws on the probabilistic model of Toussaint and Storkey (2006) that allows casting policy optimization as a problem of likelihood maximization in a mixture model. The key to turn this into an efficient model-free RL algorithm was to propose a new mixture model (17) that leads to a stochastic EM algorithm (SAEM) that uses all reward signals on a sampled trajectory. The proposed model can be viewed as a generalization of the stochastic shortest path reformulation of an infinite-horizon MDP (Bertsekas & Tsitsiklis, 1996). Since the SAEM algorithm is non-bootstrapping it can also be useful in domains exhibiting partial observability. We tried SAEM on a standard problem from the POMDP literature, with encouraging results, providing further evidence that Sarsa-style algorithms hold promise in POMDPs (Loch & Singh, 1998). The main contribution of this work is the established link between the family of stochastic EM algorithms like SAEM and the family of optimistic policy iteration algorithms like Sarsa(1), which we believe can provide new tools for the design and analysis of RL algorithms. Acknowledgements We would like to thank the reviewers for their helpful reviews. Marc Toussaint is supported by the German Research Foundation (DFG), Emmy Noether fellowship TO 409/1-3. References Figure 2. Learning curves of SAEM on the Hallway POMDP, for two different sample sizes. For small sample sizes (m = 100) the algorithm often got trapped in local maxima of the likelihood function (e.g., the lower two dashed lines), but for larger sample sizes (m = 1000) the algorithm seemed to be able to avoid bad local maxima, consistently heading for high-value policies (as suggested by the small error bars). This is an encouraging result, confirming We chose the class of stochastic memoryless policies just for illustration purposes; it is trivial to adapt SAEM to work with any stochastic finite state controller. 1 Abbeel, P., Coates, A., Quigley, M., & Y., N. A. (2007). An application of reinforcement learning to aerobatic helicopter flight. In B. Sch¨lkopf, J. Platt o and T. Hoffman (Eds.), Advances in neural information processing systems 19, 1­8. Cambridge, MA: MIT Press. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neurodynamic programming. Athena Scientific. Celeux, G., & Diebolt, J. (1985). The SEM algorithm: a probabilistic teacher algorithm derived from the State-of-the-art model-based POMDP solvers produce policies with value around 1 in this problem (Poupart, 2005; Shani et al., 2007). 2 EM algorithm for the mixture problem. Statis. Quaterly, 2, 73­82. Comp. Cooper, G. F. (1988). A method for using belief networks as influence diagrams. Proc. 4th Workshop on Uncertainty in Artificial Intelligence (pp. 55­63). Minneapolis, Minnesota, USA. Dayan, P., & Hinton, G. E. (1997). Using ExpectationMaximization for reinforcement learning. Neural Computation, 9, 271­278. Dearden, R., Friedman, N., & Russell, S. (1998). Bayesian Q-learning. Proc. 15th National Conf. on Artificial Intelligence (pp. 761­768). Madison, Wisconsin, USA. Delyon, B., Lavielle, M., & Moulines, E. (1999). Convergence of a stochastic approximation version of the EM algorithm. The Annals of Statistics, 27, 94­128. Gordon, G. (1996). Chattering in Sarsa() (Technical Report). CMU Learning Lab internal report. Hansen, E. (1998). Solving POMDPs by searching in policy space. Proc. 14th Int. Conf. on Uncertainty in Artificial Intelligence (pp. 211­219). Madison, Wisconsin, USA. Hoffman, M., Doucet, A., De Freitas, N., & Jasra, A. (2008). Bayesian policy learning with transdimensional MCMC. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in neural information processing systems 20, 665­672. Cambridge, MA: MIT Press. Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In Advances in neural information processing systems 7, 345­352. MIT Press. Kober, J., & Peters, J. (2009). Policy search for motor primitives in robotics. In D. Koller, D. Schuurmans, Y. Bengio and L. Bottou (Eds.), Advances in neural information processing systems 21, 849­856. Littman, M. L., Cassandra, A. R., & Kaelbling, L. P. (1995). Learning policies for partially observable environments: Scaling up. Proc. 12th Int. Conf. on Machine Learning (pp. 362­370). Loch, J., & Singh, S. P. (1998). Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. Proc. 15th Int. Conf. on Machine Learning (pp. 323­331). Madison, Wisconsin, USA. Melo, F. S., Meyn, S. P., & Ribeiro, M. I. (2008). An analysis of reinforcement learning with function approximation. Proc. 25th Int. Conf. on Machine Learning (pp. 664­671). Helsinki, Finland. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models, 355­368. Kluwer Academic Publishers. Perkins, T. J., & Pendrith, M. D. (2002). On the existence of fixed points for Q-learning and Sarsa in partially observable domains. Proc. 19th Int. Conf. on Machine Learning (pp. 490­497). Perkins, T. J., & Precup, D. (2003). A convergent form of approximate policy iteration. In S. T. S. Becker and K. Obermayer (Eds.), Advances in neural information processing systems 15, 1595­1602. Cambridge, MA: MIT Press. Peters, J., & Schaal, S. (2008). Learning to control in operational space. International Journal of Robotics Research, 27, 197­212. Poupart, P. (2005). Exploiting structure to efficiently solve large scale partially observable Markov decision processes. Doctoral dissertation, Dept. of Computer Science, University of Toronto. Shani, G., Brafman, R. I., & Shimony, S. E. (2007). Forward search value iteration for POMDPs. In Int. Joint Conf. on Artificial Intelligence (pp. 2619­ 2624). Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Toussaint, M., & Storkey, A. (2006). Probabilistic inference for solving discrete and continuous state Markov decision processes. Proc. 23rd Int. Conf. on Machine Learning (pp. 945­952). Pittsburgh, Pennsylvania, USA. Tsitsiklis, J. N. (2002). On the convergence of optimistic policy iteration. Journal of Machine Learning Research, 3, 59­72. Wei, G., & Tanner, M. (1990). A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithm. J. Amer. Statist. Assocation, 85, 699­704.