Trans-dimensional MCMC for Bayesian Policy Learning Matt Hoffman Dept. of Computer Science University of British Columbia hoffmanm@cs.ubc.ca Nando de Freitas Dept. of Computer Science University of British Columbia nando@cs.ubc.ca Arnaud Doucet Depts. of Statistics and Computer Science University of British Columbia arnaud@cs.ubc.ca Ajay Jasra Dept. of Mathematics Imperial College London ajay.jasra@imperial.ac.uk Abstract A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new understanding, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event. 1 Introduction Continuous state-space Markov Decision Processes (MDPs) are notoriously difficult to solve. Except for a few rare cases, including linear Gaussian models with quadratic cost, there is no closed-form solution and approximations are required [4]. A large number of methods have been proposed in the literature relying on value function approximation and policy search; including [3, 10, 14, 16, 18]. In this paper, we follow the policy learning approach because of its promise and remarkable success in complex domains; see for example [13, 15]. Our work is strongly motivated by a recent formulation of stochastic planning and control problems as inference problems. This line of work appears to have been initiated in [5], where the authors used EM as an alternative to standard stochastic gradient algorithms to maximize an expected cost. In [2], a planning problem under uncertainty was solved using a Viterbi algorithm. This was later extended in [21]. In these works, the number of time steps to reach the goal was fixed and the plans were not optimal in expected reward. An important step toward surmounting these limitations was taken in [20, 19]. In these works, the standard discounted reward control problem was expressed in terms of an infinite mixture of MDPs. To make the problem tractable, the authors proposed using the estimated posterior horizon time to truncate the mixture. Here, we make the observation that, in the probabilistic approach to stochastic control, the objective function can be written as the expectation of a positive function with respect to a trans-dimensional 1 probability distribution, i.e. a probability distribution defined on an union of subspaces of different dimensions. By reinterpreting this function as a (artificial) marginal likelihood, it is easy to see that it can also be maximized using an EM-type algorithm in the spirit of [5]. However, the observation that we are dealing with a trans-dimensional distribution enables us to go beyond EM. We believe it creates many opportunities for exploiting a large body of sophisticated inference algorithms in the decision-making context. In this paper, we propose a full Bayesian policy search alternative to the EM algorithm. In this approach, we set a prior distribution on the set of policy parameters and derive an artificial posterior distribution which is proportional to the prior times the expected return. In the simpler context of myopic Bayesian experimental design, a similar method was developed in [11] and applied successfully to high-dimensional problems [12]. Our method can be interpreted as a trans-dimensional extension of [11]. We sample from the resulting artificial posterior distribution using a single transdimensional MCMC algorithm, which only involves a simple modification of the MCMC algorithm developed to implement EM. Although, the Bayesian policy search approach can benefit from gradient information, it does not require gradients. Moreover, since the target is proportional to the expected reward, the simulation is guided to areas of high reward automatically. This property results in an immediate reduction in variance in policy search. 2 Model formulation We consider the following class of discrete-time Markov decision processes (MDPs): X1 µ(·) Xn | (Xn-1 = x, An-1 = a) fa ( ·| x) Rn | (Xn = x, An = a) ga ( ·| x) An | (Xn = x, ) ( ·| x) , (1 ) where n = 1, 2, . . . is a discrete-time index, µ(·) is the initial state distribution, {Xn } is the X -valued state process, {An } is the A-valued action process, {Rn } is a positive real-valued reward process, fa denotes the transition density, ga the reward density and is a randomized policy. If we have a deterministic policy then ( a| x) = (x) (a). In this case, the transition model fa ( ·| x) assumes the parametrization f ( ·| x). The reward model could also be parameterized as g ( ·| x). We are here interested in maximizing with respect to the parameters of the policy the expected future return , n n-1 Vµ () = E Rn =1 X k × Ak × R+ given by where 0 < < 1 is a discount factor and the expectation is with respect to the probabilistic model defined in (1). As shown in [20], it is possible re-write this objective of optimizing an infinite horizon discounted reward MDP (where the reward happens at each step) as one of optimizing an infinite mixture of finite horizon MDPs (where the reward only happens at the last time step). { In particular, we note that by introducing the trans-dimensional probability distribution on k} × p (k , x1:k , a1:k , rk ) = (1 - ) k-1 µ (x1 ) gak ( rk | xk ) nk fan-1 ( xn | xn-1 ) nk ( an | xn ) , =2 =1 (2 ) we can easily rewrite Vµ () as an infinite mixture model of finite horizon MDPs, with the reward happening at the last horizon step; namely at k . Specifically we have: Vµ () = (1 - ) -1 Ep [RK ] = (1 - )-1 k 2 =1 r k p (k , x1:k , a1:k , rk ) dx1:k da1:k drk (3 ) for a randomized policy. Similarly, for a deterministic policy, the representation (3) also holds for { the trans-dimensional probability distribution defined on k } × X k × R+ given by p (k , x1:k , rk ) = (1 - ) k -1 µ (x1 ) g ( rk | xk ) nk f ( xn | xn-1 ) . (4 ) =2 The representation (3) was also used in [6] to compute the value function through MCMC for a fixed . In [20], this representation is exploited to maximize Vµ () using the EM algorithm which, applied to this problem, proceeds as follows at iteration i i = arg max Q (i-1 , ) wh e r e Q (i-1 , ) = Epi-1 [log (RK .p (K, X1:K , A1:K , RK ))] , e rk p (k , x1:k , a1:k , rk ) . Ep [RK ] Unlike [20], we are interested in problems with potentially nonlinear and non-Gaussian properties. In these situations, the Q function cannot be calculated exactly and we need to simulate from p (k , x1:k , a1:k-1 , rk ) in order to obtain Monte Carlo estimates of the Q function. The good news is that p (k , x1:k , a1:k-1 , rk ) is proportional to the reward. Consequently, the samples will be drawn where there is high utility. This is a wonderful feature in situations where the reward is a rare event, which is often the case in high dimensional control settings. p (k , x1:k , a1:k-1 , rk ) = At this stage, we could proceed as in [20] and derive forward-backward algorithms for the E step. We have in fact done this using the smoothing algorithms proposed in [9]. However, we will focus the discussion on a different approach based on trans-dimensional simulation. As shown in the experiments, the latter does considerably better. Finally, we remark that for a deterministic policy, we can introduce the trans-dimensional distribution: rk p (k , x1:k , rk ) p (k , x1:k , rk ) = Ep [RK ] . In addition, and for ease of presentation only, we focus the discussion on deterministic policies and reward functions g ( rn | xn ) = r(xn ) (rn ) ; the extension of our algorithms to the randomized case is straightforward. 3 Bayesian policy exploration The EM algorithm is particularly sensitive to initialization and might get trapped in a severe lo cal maximum of Vµ (). Moreover, in the general state-space setting that we are considering, the particle smoothers in the E step can be very expensive computationally. To address these concerns, we propose an alternative full Bayesian approach. In the simpler context of experimental design, this approach was successfully developed in [11], [12]. The idea consists of introducing a vague prior distribution p () on the parameters of the policy . We then define the { new artificial probability distribution defined on × k } × X k × R+ X b y p (, k , x1:k ) r (xk ) p (k , x1:k ) p () . p () Vµ () p () By construction, this target distribution admits the following marginal in Vµ () d < . w If we could sample from p (), then the generated samples (i) ould concentrate themselves in regions where Vµ () is large. We cannot sample from p () directly but we can developed a trans-dimensional MCMC algorithm which will generate asymptotically samples from p (, k , x1:k ), hence samples from p (). and we can select an improper prior distribution p () 1 if 3 Our algorithm proceeds as follows. Assume the current state of the Markov chain targeting p (, k , x1:k ) is (, k , x1:k ). We propose first to update the components (k , x1:k ) conditional upon using a combination of birth, death and update moves using the reversible jump MCMC algorithm [7, 8, 17]. Then we propose to update conditional upon the current value of (k , x1:k ). This can be achieved using a simple Metropolis-Hastings algorithm or a more sophisticated dynamic Monte Carlo schemes. For example, if gradient information is available, one could adopt Langevin diffusions and the hybrid Monte Carlo algorithm [1]. The overall algorithm is depicted in Figure 1. The details of the reversible jump algorithm are presented in the following section. 1. Initialisation: set (k(0) , x1:k(0) , (0) ). 2. For i = 0 to N - 1 · Sample u U[0,1] . · I f (u bk ) ­ then carry out a "bir th" move: Increase the horizon length of the MDP, say k(i) = k(i-1) + 1 and inser t a new state. ­ else if (u bk + dk ) then carry out a "death" move: decrease the horizon length of the MDP, say k(i) = k(i-1) - 1 and an existing state. (i) ­ else let k(i) = k(i-1) and generate samples x1:k(i) of the MDP states. End If. (i) · Sample the policy parameters (i) conditional on the samples (x1:k(i) , k(i) ). (0) Figure 1: Generic reversible jump MCMC for Bayesian policy learning. We note that the samples of the states and horizon generated by this Markov process will also be distributed according to the trans-dimensional distribution p (k , x1:k ), this is indeed the output of the reversible jump algorithm for a given . Hence, they can be easily adapted to generate a Monte Carlo estimate of Q (i-1 , ). This allows us to side-step the need for expensive smoothing algorithms in the E step. The trans-dimensional simulation approach has the advantage that the samples will concentrate themselves automatically in regions where p (k ) has high probability masses. Moreover, unlike in the EM framework, it is no longer necessary to truncate the time domain. 4 Trans-Dimensional Markov chain Monte Carlo We present a simple reversible jump method composed of two reversible moves (birth and death) and several update moves. Assume the current state of the Markov chain targeting p (k , x1:k ) is (k , x1:k ). With probability bk , we propose a birth move; that is we sample a location uniformly in the interval {1, ..., k + 1}, i.e. J U {1, ..., k + 1}, and propose the candidate (k + 1, x1:j -1 , x , xj :k ) where X q ( ·| xj -1:j ). This candidate is accepted with probability Abirth = min{1, birth } where we have for j {2, ..., k - 1} p (k + 1, x1:j -1 , x , xj :k ) dk+1 birth = p (k , x1:k ) bk q ( x | xj -1:j ) f ( x | xj -1 ) f ( xj | x ) dk+1 , = f ( xj | xj -1 ) bk q ( x | xj -1:j ) fo r j = 1 µ (x ) f ( x1 | x ) dk+1 birth = , µ (x1 ) bk q ( x | x1 ) an d j = k + 1 r (x ) f ( x | xk ) dk+1 birth = . r (xk ) bk q ( x | xk ) 4 With probability dk , we propose a death move; that is J U {1, ..., k } and we propose the candidate (k - 1, x1:j -1 , xj +1:k ) which is accepted with probability Adeath = min{1, death } where for j {2, ..., k - 1} death = p (k - 1, x1:j -1 , xj +1:k ) bk+1 q ( xj | xj -1:j +1 ) p (k , x1:k ) dk f ( xj +1 | xj -1 ) bk+1 q ( xj | xj -1:j +1 ) = , f ( xj +1 | xj ) f ( xj | xj -1 ) dk death = an d f o r j = k r (xk-1 ) q ( xk | xk-1 ) bk+1 . r (xk ) f ( xk | xk-1 ) dk Finally with probability uk = 1 - bk - dk , we propose a standard (fixed dimensional) move where we update all or a subset of the components x1:k using say Metropolis-Hastings or Gibbs moves. There are many design possibilities for these moves. In general, one should block some of the variables so as to improve the mixing time of the Markov chain. If one adopts a simple one-at-a time Metropolis-Hastings scheme with proposals q ( x | xj -1:j +1 ) to update the j -th term, then the candidate is accepted with probability Aupdate = min{1, update } where for j {2, ..., k - 1} death = update = p (k , x1:j -1 , x , xj +1:k ) q ( xj | xj -1 , x , xj +1 ) p (k , x1:k ) q ( x | xj -1:j +1 ) f ( x | xj -1 ) f ( xj +1 | x ) q ( xj | xj -1 , x , xj +1 ) = , f ( xj | xj -1 ) f ( xj +1 | xj ) q ( x | xj -1:j +1 ) update = an d f o r j = k r (x ) f ( x | xk-1 ) q ( xk | x , xk-1 ) . r (xk ) f ( xk | xk-1 ) q ( x | xk-1:k ) g K (i) (i) Under weak assumptions on the model, the Markov chain , X1:K enerated by this transition kernel will be irreducible and aperiodic and hence will generate asymptotically samples from the target distribution p (k , x1:k ). update = µ (x ) f ( x2 | x ) q ( x1 | x , x2 ) , µ (x1 ) f ( x2 | x1 ) q ( x | x1:2 ) µ (x2 ) q ( x1 | x2 ) bk+1 , µ (x1 ) f ( x2 | x1 ) dk fo r j = 1 fo r j = 1 We emphasize that the structure of the distributions p ( x1:k | k ) will not in many applications vary significantly with k and we often have p ( x1:k | k ) p ( x1:k | k + 1). Hence the probability of having the reversible moves accepted will be reasonable. Standard Bayesian applications of reversible jump MCMC usually do not enjoy this property and it makes it more difficult to design fast mixing algorithms. In this respect, our problem is easier. 5 Experiments It should be noted from the outset that the results presented in this paper are preliminary, and serve mainly as an illustration of the Monte Carlo algorithms presented earlier. With that note aside, even these simple examples will give us some intuition about the algorithms' performance and behavior. We are also very optimistic as to the possible applications of the analytic expressions for linear Gaussian models, but space has not allowed us to present simulations for this class of models here. We will consider state- and action-spaces X = A = R2 such that each state x X is a 2d position and each action a A is a vector corresponding to a change in position. A new state at time n is given by Xn = Xn-1 + An-1 + n-1 where n-1 denotes zero-mean Gaussian noise. Finally we will let µ be a normal distribution about the origin, and consider a reward (as in [20]) given by 1 an unnormalized Gaussian about some point m, i.e. r(x) = exp(- 2 (x - m)T -1 (x - m)). An illustration of this space can be seen in Figure 2 where m = (1, 1). 5 1.4 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Figure 2: This figure shows an illustration of the 2d state-space described in section 5. Ten sample points are shown distributed according to µ, the initial distribution, and the contour plot corresponds to the reward function r . The red line denotes the policy parameterized by some angle , while a path is drawn in blue sampled from this policy. Convergence of as a function of time 1.6 Two-filter EM (error in two-filter EM) Monte Carlo EM Bayes. policy search Optimal (baseline) Monte Carlo EM (error in MCEM) Bayes. policy search (error in p. search) Optimal (baseline) Convergence of as a function of time 1.5 1.1 1.4 1.3 1 (in radians) 1.2 (in radians) 0 500 1000 1500 2000 2500 3000 0.9 1.1 1 0.8 0.9 0.7 0.8 0.7 -500 0.6 0 100 200 300 400 500 600 700 800 900 1000 cpu time (in seconds) cpu time (in seconds) Figure 3: The left figure shows estimates for the policy parameter as a function of the CPU time used to calculate that value. This data is shown for the three discussed Monte Carlo algorithms as applied to a synthetic example and has been averaged over five runs; error bars are shown for the SMC-based EM algorithm. The right figure shows a "zoomed" version of this plot in order to see the reversible-jump EM algorithm and the fully Bayesian algorithm in more detail. In both plots the red line denotes the known optimal policy parameter of /4. For these experiments we chose a simple, stochastic policy parameterized by [0, 2 ]. Under this policy, an action An = (w + ) · (cos( + ), sin( + )) is taken where and are normally distributed random variables and w is some (small) constant step-length. Intuitively, this policy corresponds to choosing a direction in which the agent will walk. While unrealistic from a real-world perspective, this allows us a method to easily evaluate and plot the convergence of our algorithm. For a state-space with initial distribution and reward function defined as in Figure 2 the optimal policy corresponds to = /4. We first implemented a simple SMC-based extension of the EM algorithm described in [20], wherein a particle filter was used for the forwards/backwards filters. The plots in Figure 3 compare the SMC-based and trans-dimensional approaches performing on this synthetic example. Here the inferred value of is shown against CPU time, averaged over 5 runs. The first thing of note is the terrible performance of the SMC-based algorithm--in fact we had to make the reward broader and closer to the initial position in order to ensure that the algorithm converges in a reasonable amount 6 Figure 4: Convergence of PEGASUS and our Bayesian policy search algorithm when started from = 0 and converging to the optimum of = /4. The plots are averaged over 10 runs. For our algorithm we plot samples taken directly from the MCMC algorithm itself, and for both algorithms lines denoting one standard deviation are shown. Both algorithms' performance is plotted against the number of samples taken from the transition model. 2 of time. This comes as no surprise considering the O(N 2 kmax ) time complexity necessary for computing the importance weights. While there do exist methods [9] for reducing this complexity to 2 O(N log N kmax ), the discrepancy between this and the reversible jump MCMC method suggests that the MCMC approach may be more adapted to this class of problems. In the finite/discrete case 2 it is also possible, as shown by Toussaint et al (2006), to reduce the kmax term to kmax by calculating updates only using messages from the backwards recursion. The SMC method might further be improved by better choices for the artificial distribution n (xn ) in the backwards filter. In this problem we used a vague Gaussian centered on the relevant state-space. It is however possible that any added benefit from a more informative distribution is counterbalanced by the time required to calculate this , for example by simulating particles forward in order to find the invariant distribution, etc. Also shown in figure 3 is the performance of a Monte Carlo EM algorithm using reversible jump MCMC in the E-step. Both this and the fully Bayesian approach perform comparably, although the fully Bayesian approach shows less in-run variance, as well as less variance between runs. The EM algorithm was also more sensitive, and we were forced to increase the number of samples N used by the E-step as the algorithm progressed, as well as controlling the learning rate with a smoothing parameter. For higher dimensional and/or larger models it is not inconceivable that this could have an adverse affect on the algorithms performance. Finally, we also compared the proposed Bayesian policy exploration method to the PEGASUS [14] approach using a local search method. We initially tried using a policy-gradient approach, but because of the very highly-peaked rewards the gradients become very poorly scaled and would have required more tuning. As shown in Figure 4, the Bayesian strategy is more efficient in this rare event setting. As the state space increases, we expect this difference to become even more pronounced. 6 Discussion We believe that formulating stochastic control as a trans-dimensional inference problem is fruitful. This formulation relies on minimal assumptions and allows us to apply modern inference algorithms to solve control problems. We have focused here on Monte Carlo methods and have presented - to the best of our knowledge - the first application of reversible jump MCMC to policy search. Our results, on an illustrative example, showed that this trans-dimensional MCMC algorithm is more effective that standard policy search methods and alternative Monte Carlo methods relying on particle filters. 7 However, this methodology remains to be tested on high dimensional problems. For such scenarios, we expect that it will be necessary to develop more efficient MCMC strategies to explore the policy space efficiently. References [1] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to MCMC for machine learning. Machine Learning, 50:5­43, 2003. [2] H. Attias. Planning by probabilistic inference. In Uncertainty in Artificial Intelligence, 2003. [3] J. Baxter and P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319­350, 2001. [4] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 1995. [5] P. Dayan and G. E. Hinton. Using EM for reinforcement learning. Neural Computation, 9:271­278, 1997. [6] A. Doucet and V. B. Tadic. On solving integral equations using Markov chain Monte Carlo methods. Technical Report CUED-F-INFENG 444, Cambridge University Engineering Department, 2004. [7] P. J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82:711­732, 1995. [8] P. J. Green. Trans-dimensional Markov chain Monte Carlo. In Highly Structured Stochastic Systems, 2003. [9] M. Klaas, M. Briers, N. de Freitas, A. Doucet, and S. Maskell. Fast particle smoothing: If i had a million particles. In International Conference on Machine Learning, 2006. [10] G. Lawrence, N. Cowan, and S. Russell. Efficient gradient estimation for motor control learning. In Uncertainty in Artificial Intelligence, pages 354­36, 2003. [11] P. Muller. Simulation based optimal design. Bayesian Statistics, 6, 1999. ¨ [12] P. Muller, B. Sanso, and M. De Iorio. Optimal Bayesian design by inhomogeneous Markov chain simu¨ ´ lation. J. American Stat. Assoc., 99:788­798, 2004. [13] A. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, and E. Liang. Inverted autonomous helicopter flight via reinforcement learning. In International Symposium on Experimental Robotics, 2004. [14] A. Y. Ng and M. I. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Uncertainty in Artificial Intelligence, 2000. [15] J. Peters and S. Schaal. Policy gradient methods for robotics. In IEEE International Conference on Intelligent Robotics Systems, 2006. [16] M. Porta, N. Vlassis, M. T. J. Spaan, and P. Poupart. Point-based value iteration for continuous POMDPs. Journal of Machine Learning Research, 7:2329­2367, 2006. [17] S. Richardson and P. J. Green. On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society B, 59(4):731­792, 1997. [18] S. Thrun. Monte Carlo POMDPs. In S. Solla, T. Leen, and K.-R. Muller, editors, Neural Information ¨ Processing Systems, pages 1064­1070. MIT Press, 2000. [19] M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic inference for solving (PO)MDPs. Technical Report EDI-INF-RR-0934, University of Edinburgh, School of Informatics, 2006. [20] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state Markov decision processes. In International Conference on Machine Learning, 2006. [21] D. Verma and R. P. N. Rao. Planning and acting in uncertain environments using probabilistic inference. In IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 2006. 8