Approximate Inference for Planning in Stochastic Relational Worlds Tobias Lang lang@cs.tu-berlin.de Marc Toussaint mtoussai@cs.tu-berlin.de TU Berlin, Machine Learning Group, Franklinstrasse 28/29, 10587 Berlin, Germany Abstract Relational world models that can be learned from experience in stochastic domains have received significant attention recently. However, efficient planning using these models remains a major issue. We propose to convert learned noisy probabilistic relational rules into a structured dynamic Bayesian network representation. Predicting the effects of action sequences using approximate inference allows for planning in complex worlds. We evaluate the effectiveness of our approach for online planning in a 3D simulated blocksworld with an articulated manipulator and realistic physics. Empirical results show that our method can solve problems where existing methods fail. tion spaces (van Otterlo, 2009). Most work has been on model-free approaches which compute value functions and policies directly from experiences resulting in reactive behaviors for fixed goals (also called habitbased decision making). By contrast, model-based approaches use the experiences to learn world dynamics and reward models. This enables goal-directed or purposive decision making which allows an agent with changing goals to plan for the goal at hand by internal simulation (Botvinick & An, 2009). Pasula et al. (2007) have recently introduced an appealing action model representation based on noisy indeterministic deictic (NID) rules which offer several advantages: (i) a relational representation; (ii) indeterministic action outcomes in order to account for stochastic domains; (iii) deictic references for actions to reduce action space; (iv) noise outcomes to avoid explicit modeling of rare and overly complex outcomes; and, (v) the existence of an effective learning algorithm. They showed that NID rules are able to capture the complex dynamics in a realistically simulated noisy version of the blocks world based on a threedimensional rigid-body dynamics simulator (ODE). A severe limitation, however, is that there exists no efficient method so far to plan with NID rules. Promising solution techniques for known Markov decision processes (MDPs) over relational domains have been proposed: for example, Kersting et al. (2004) present an exact value iteration algorithm and Sanner and Boutilier (2007) propose approximate solution techniques for factored first-order MDPs based on linear value approximation. Both require complete models, however, which NID rules do not provide due to their noise outcomes, and it is not clear how to make them deal with deictic references in the spirit of NID rules. Croonenborghs et al. (2007) learn partial world models online in form of sets of relational probability trees for individual state properties which they exploit immediately by means of Q-learning with look-ahead trees. They do not model deictic references explicitly so that their action space is much larger which heavily affects planning time. We will see later that planning with look-ahead trees is inefficient. 1. Introduction Building systems that act in complex environments is a central goal of Artificial Intelligence. Such systems need to be able to reason about their world in order to derive plans of actions and achieve their goals. Acting in a complex environment requires knowledge representations which can account for indeterministic action effects and cope with noise. Furthermore, they have to generalize to unencountered situations and objects of similar types. The field of statistical relational learning investigates the combination of relational representations with probabilistic frameworks and Machine Learning techniques (Getoor & Taskar, 2007). Due to the general nature of the proposed representations, however, it is often unclear how to learn and represent complex action dynamics in an efficient way. Action decision making is often cast in a reinforcement learning (RL) framework. Relational RL investigates the usage of compact relational representations for state and acAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Approximate Inference for Planning in Stochastic Relational Worlds In this paper, we introduce the PRADA algorithm (probabilistic relational action-sampling in DBNs planning algorithm), a model-based RL approach for planning based on NID rules and probabilistic inference, making three contributions: (i) Following the idea of framing planning as a probabilistic inference problem (Toussaint & Storkey, 2006), we convert NID rules into a dynamic Bayesian networks (DBNs) representation. (ii) We derive an approximate inference method to cope with the state complexity of a timeslice of the resulting network. This enables us to efficiently predict the effects of action sequences. While Sanghai et al. (2005) propose particle filtering algorithms for approximate inference in DBNs over relational domains, we can exploit different structural assumption and develop an approximate inference technique based on the factored frontier algorithm (Murphy & Weiss, 2001). (iii) To enable efficient samplingbased planning, we propose a sampling distribution for actions which takes predicted state distributions into account. We evaluate our approach in a simulated blocksworld with realistic physics, similarly as in Pasula et al., but with an articulated humanoid manipulating the blocks, showing the effectiveness of PRADA for fast online planning. The remainder of this paper is organized as follows. The next section presents the theoretical background of our work, namely NID rules and the planning algorithm which was used with NID rules so far. We introduce our DBN representation for NID rules in Sec. 3 and our approximate inference method in Sec. 4. In Sec. 5, we present our planning procedure. Then, we show our empirical results before we conclude. and functions over these objects. Generally, a formula may use logical variables which represent any object, independent of the concrete instantiation of objects O. We will speak of grounding a formula if we apply a substitution that maps all of the variables appearing in to objects in O. We have a finite set P of predicates and a finite set F of functions. The state of the world is fully described by all ground predicates and functions. Actions are represented by positive predicates A. NID rules incorporate knowledge about state transition dynamics and are described in the following. 2.2. Noisy indeterministic deictic (NID) rules We want to learn a relational model of a stochastic world and use it for planning. Pasula et al. (2007) proposed NID rules for representing such a model and an algorithm to learn them from data. Table 1 shows an exemplary NID rule for the realistic blocksworld. A NID rule r is given as : r,1 (X ) pr,1 . . . (1) ar (X ) : r (X ) pr,m : r,m (X ) r r pr,0 : r,0 where X is a set of logic variables in the rule (which represent a (sub-)set of abstract objects). In the rules which define our world models all formula arguments are logic variables. The rule r consists of preconditions, namely that action ar is applied on X and that the state context r is fulfilled, and mr + 1 different outcomes with associated probabilities pr,i > 0, i=0 pr,i = 1. Each outcome r,i (X ) describes which predicates and functions change when the rule is applied. The context r (X ) and outcomes r,i (X ) are conjunctions of literals constructed from the predicates in P as well as equality statements comparing functions from F to constant values. The so-called noise outcome r,0 subsumes all possible action outcomes which are not explicitly specified by one of the other r,i . The arguments of the action a(Xa ) may be a true subset Xa X of the variables X of the rule. The remaining variables are called deictic references DR = X \ Xa and denote objects relative to the agent or action being performed. As above, let denote a substitution that maps variables to constant objects, : X O. Applying to an abstract rule r(X ) yields a ground rule r((X )). We say a ground rule r covers a state s and a ground action a if s |= r and a = ar . Let be a set of ground NID rules. We define (a) := {r | r , ar = a} to be the set of rules that provide predictions for action a. If r is the only rule in (a) to cover a and state s, we call it the unique covering rule for a in s. For more technical details, we refer the reader to Pasula et al.. If 2. Background 2.1. State and action representation We use a relational representation to describe world states and rules. In a concrete instantiation (situation) we assume there is a finite set of objects O present and we can form logical formulae of predicates (relations) Table 1. Example NID rule for a realistic blocksworld, which models to try to grab a block X. The block Y is implicitly defined as the one below X (deictic referencing). X ends up in the robot's hand with high probability, but might also fall on the table. With a small probability something unpredictable happens. grab(X) : on(X, Y ), block(Y ), table(Z) ( 0.7 : inhand(X), ¬on(X, Y ) 0.2 : on(X, Z), ¬on(X, Y ) 0.1 : noise Approximate Inference for Planning in Stochastic Relational Worlds a pair (a, s) has a unique covering rule r, we calculate P (s | a, s) using mr better to perform a doN othing action instead. Hence, in SST planning one can discard all actions for a given state which do not have unique covering rules. While SST is near-optimal, in practice it is only feasible for very small b and d. Let the number of actions be a. Then the number of nodes at time horizon d is (ba)d . (This number can be reduced if the same outcome of a rule is sampled multiple times.) While smaller choices of b lead to faster planning, they result in a significant accuracy loss in realistic domains. As Kearns et al. note SST is only useful if no special structure that permits compact representation is available. Therefore, we now introduce an alternative planning approach that exploits the structure of NID rules. P (s |r, s) = i=1 pr,i P (s |r,i , s) +pr,0 P (s |r,0 , s), (2) where, for i > 0, P (s | r,i , s) is a deterministic distribution that is one for the unique state constructed from s taking the changes of r,i into account. The distribution given the noise outcome, P (s | r,0 , s), is unknown and needs to be estimated, e.g. by assigning low probability to many successor states. If a pair (a, s) does not have a unique covering rule r (e.g. two rules cover (a, s) providing conflicting predictions), one can predict the effects of a by means of a noisy default rule r which explains all effects as noise: P (s |r , s) = P (s | r ,0 , s). This is not very meaningful and thus disadvantageous for planning. For this reason, the concept of unique covering rules is crucial in planning with NID rules. NID rules can be learned from experience triples (s, a, s ) by means of a batch algorithm that trades off the likelihood of these triples with the complexity of the learned rule-set. 2.3. Sparse sampling trees for MDP planning To plan with NID rules, one can treat the domain described by the relational logic vocabulary as a Markov decision process (MDP). NID rules encapsulate the transition probabilities in a compact way exploiting the relational structure, as described in Eq. (2). As the MDP model created from NID rules is incomplete due to the noise outcome and exact solution methods are hard to come by in relational domains, Pasula et al. use the sparse sampling tree (SST) algorithm (Kearns et al., 2002) for MDP planning. Given a planning horizon d and a branching factor b, SST builds a look-ahead tree of states starting with the current state. In each tree node (representing a state), (i) SST takes all possible actions into account, and (ii) for each action it takes b samples from the successor state distribution using a transition model (in our case the NID rules). Values are computed for each node of the tree using the Bellman equation, and finally the action with the highest value is chosen. SST is independent of the number of states, but exponential in the time horizon taken into account. When sampling the noise outcome while planning with SST, Pasula et al. approximate the obtained value by assuming to stay in the same state and discounting the estimated value. (We refer to this adaptation when we speak of SST planning in the remainder of the paper.) If an action does not have a unique covering rule and is modelled by the noisy default rule r , it is thus always 3. Graphical models for NID rules Decision theoretic problems where agents need to choose appropriate actions can be represented by means of Markov chains and DBNs. In the following, we discuss how to convert NID rules to DBNs which the PRADA algorithm will use to plan by probabilistic inference as described in Sec. 5. We denote random variables by upper case letters (e.g. S), their values by the corresponding lower case letters (e.g., s dom(S)), variable vectors by bold upper case letters (e.g. S = (S1 , S2 , S3 )) and value vectors by bold lower case letters (e.g. s = (s1 , s2 , s3 )). We also use column notation, e.g. s2:4 = (s2 , s3 , s4 ). A naive way to convert NID rules to DBNs is shown in Fig. 1(a). States are represented by a vector S = (S1 , . . . , SN ) where for each ground predicate in P there is a binary Si and for each ground function in F there is an Si with range according to the represented function. (In this paper, we restrict ourselves to functions which range over a finite subset of the integers.) Actions are represented by an integer variable A which indicates the action out of a vector of ground action predicates in A. The reward gained in a state is represented by U and may depend only on a subset of the state variables. It is possible to express arbitrary reward expectations P (U | S) with binary U (Dayan & Hinton, 1997). Using NID rules to define the transition dynamics leads to very complex dependencies as one needs to account for the uniqueness of covering rules. This may result in conditional probability functions comprising huge subsets of S which makes this representation unfeasible for planning. Therefore, we exploit the structure of NID rules to model a state transition with the graphical model shown in Fig. 1(b) representing the joint distribution P (u , s , o, r, | a, s) (3) = P (u | s ) P (s | o, r, s) P (o | r) P (r | a, ) P ( | s) , Approximate Inference for Planning in Stochastic Relational Worlds semantics, we introduce empty dummy outcomes with zero-probability for those rules whose number of outcomes is less than M .) The probability of an outcome is defined as in the corresponding rule: P (O = o | r) = pr,o . We define the probability of the successor state as P (s | o, r, s) = i (7) P (si | o, r, si ) , (8) (a) (b) which is one for the unique state that is constructed from s taking the changes according to r,o into account. The probability of the reward is given by P (U = 1 | s ) = I j(U ) Figure 1. (a) Naive DBN; (b) DBN exploiting NID factorization Sj = j . (9) which we will explain in detail in the following. Assume we are given a set of fully abstract NID rules. We compute all groundings of these rules w.r.t. the objects of the domain and get the set of K different ground NID rules. In addition to S, S , A, U and U as above, we use a binary random variable i for each rule to model the event that its context holds, which is the case if all required literals hold: K K The function (U ) yields the set of indices of the state variables in s , on which U depends. The configuration of these variables that corresponds to our planning goal is denoted by . We renounce on specifying a prior P (s0 ), since the initial state s0 will always be given. Our choice for the distribution P (a) used for sampling actions will be described in Sec. 5. Exact inference is intractable in our graphical model. When constructing a junction tree, we will get cliques that comprise whole Markov slices (all variables representing the state at a certain time-step). Also, approximate inference by means of loopy belief propagation is unfeasible due to the deterministic dependencies in small cycles which inhibit convergence. Therefore, we propose a different approximate inference scheme which we present next. P ( | s) = P (i |s(i ) ) = i=1 i=1 I Sj = sri ,j . (4) j(i ) We use i i to express a logical conjunction 1 · · · n . The function () yields the set of indices of the state variables in s, on which depends. sri denotes the configuration of the state variables corresponding to the literals in the context of ri . We use an integervalued variable R ranging over K + 1 possible values to identify the rule which predicts the effects of the action. This is the unique covering rule for the current state-action pair, i.e., the only rule modeling action a whose context holds: P (R = r|a, ) = Ir (a)r = 1 r = 0. (5) r (a)\{r} 4. Approximate inference We follow the idea of the factored frontier (FF) algorithm (Murphy & Weiss, 2001) and approximate the belief with a product of marginals: P (st | a0:t-1 ) i P (st | a0:t-1 ) . i (10) We define (st ) i (st ) := P (st | a0:t-1 ) and i N If no unique covering rule exists, we predict no changes as indicated by the special value R = 0 (assuming not to execute the action, similarly as SST does): P (R = 0 | a, ) = ¬Ir = 1 r = 0 (6) r(a) r (a)\{r} (11) (st ) i (12) := P (st | a0:t-1 ) i=1 and derive the following FF filter for the model in Fig. 1(b). We are interested in inferring the state distribution at time t + 1 given an action sequence a0:t : (st+1 ) = P (st+1 | a0:t ) i i = rt The integer-valued variable O represents the outcome of the action as predicted by the rule. It ranges over M possible values where M is the maximum number of outcomes all rules in have. (To ensure a sound (13) (14) P (st+1 | rt , a0:t-1 ) P (rt | a0:t ) . i Approximate Inference for Planning in Stochastic Relational Worlds We compute the first term in (14) as P (st+1 | rt , a0:t-1 ) = i st i with (21) | t = 1, a0:t-1 ) r P (st+1 | rt , st ) P (st | rt , a0:t-1 ) P (t = 0 | t = 1, a0:t-1 ) r r i i i = P (st+1 i st i |r t , st ) i (st ) i P (t r st = 0 | s ) P (s t t . (15) This approximation assumes that st is conditionally i independent of rt . To improve on this approximation one could examine whether st is part of the context of i rt or whether st+1 is specified in its outcomes: if this i is the case, we can infer the state of st from knowing i rt . However, we found our approximation be sufficient. We calculate the successor state distribution as P (st+1 | rt , st ) = i i o t+1 which shows us how to update the belief over Si if t we predict with rule r . For the computation of the second term in (14), we start with 1.0 1.0 - i(t ), r i(t ) r if r r t , (Si = sr ,i ) otherwise where the if-condition expresses a logical contradiction of the contexts of r and r . Finally, we compute the reward probability straightforwardly as P (U t = 1 | a0:t-1 ) = st P (U t = 1 | st )P (st | a0:t-1 , s0 ) t (Si = i ) , i(U t ) P (st+1 | o, rt , st ) P (o | rt ) , (16) i i (22) where denotes the configuration of state variables corresponding to the planning goal as in (9). The computational costs of propagating the effects of an action are quadratic in the number of rules for this action and linear in the maximum numbers of context literals and manipulated state properties of those rules. Our inference framework requires an approximation for the distribution P (s | r,0 , s) (cf. Eq. (2)) to cope with the noise outcome of NID rules. From the training data used to learn rules, we estimate which predicates and functions change value over time as follows: let Sc S contain the corresponding variables. We estimate for each rule r the average number N r of changed state properties when the noise outcome applies. We approximate the probability that Si Sc r changes in r's noise outcome by | NC | . In case of S change, all changed values of Si have equal probability. P (Rt = r | a0:t ) = t P (Rt = r | t , a0:t ) P (t | a0:t ) t = 0 | a0:t-1 r r (at )\{r} = I(r (at )) P t = 1, r = I(r (a )) ·P r t P (t = 1 | a0:t-1 ) r t = 0 | t = 1, a0:t-1 . r r (17) (at )\{r} To simplify the summation over t , we used that r models the state transition if and only if it uniquely covers (at , st ). We calculate the second term in (17): P (t = 1 | a0:t-1 ) = r st P (t = 1 | st ) (st ) r P (t = 1 | st ) r st j t (Sj = sr,j ) j(t ) r = (st ) (18) j . (19) 5. Action-sampling planning PRADA plans in stochastic relational domains by predicting the effects of action sequences using the graphical model in Fig. 1(b) and the approximate inference method described in the last section. We sample sequences of actions a0:T -1 of length T . For 0 < t T , we compute the posteriors over states P (st | a0:t-1 , s0 ) and rewards P (ut | a0:t-1 , s0 ) (in the sense of filtering or state monitoring) and calculate the value of an action sequence with a discount factor 0 < < 1 as T The approximation in (18) is the FF assumption. In (19), sr denotes the configuration of the state variables according to the context of r like in (4). Thus, the t terms (Sj = sr,j ) correspond to the probabilities of the literals in r's context. We calculate the third term in (17) as P r (at )\{r} Q(a0:T -1 , s0 ) := t=1 t P (U t = 1 | a0:t-1 , s0 ) . (23) t = 0 | t = 1, a0:t-1 r r P (t = 0 | t = 1, a0:t-1 ) r r (20) r (at )\{r} We choose the first action of the best sequence a = argmaxa0:T -1 Q(a0:T -1 , s0 ), if its value exceeds a certain threshold (e.g., = 0). Otherwise, we continue sampling until either an action is found or sampling is Approximate Inference for Planning in Stochastic Relational Worlds given up. We aim for a strategy to sample good action sequences with high probability. We propose to choose with equal probability among the actions that have a unique covering rule for the current state and thereby avoid the use of the noisy default rule. For the action at time t, PRADA samples from the distribution t Psample (a) Pt = 1, r t = 0 | a0:t-1 , (24) r r(a) r (a)\{r} taking the current state distribution into account. Thereby, the probability to sample an action sequence a predicting the state sequence s0 , . . . , sT depends on the likelihood of the state sequence given a: the more likely the required outcomes are, the more likely the next actions will be sampled. PRADA does not miss actions which SST explores: Theorem 1: The set of action sequences PRADA samples with non-zero probability is a super-set of the one of SST. A proof of this theorem can be found on http://cs.tu-berlin.de/lang/prada/. Besides the difference in handling the noise outcome, PRADA and SST follow opposite approaches: PRADA samples actions and calculates the transitions approximately, while SST considers all actions (and is thus exact in its action search) and samples transitions. This implies an important difference ­ both algorithms are faced with the problem that the search space of action sequences is exponential in the planning horizon. To calculate the value of a given action sequence, however, SST still needs exponential time due to its outcome sampling. In contrast, PRADA propagates the state transitions forward and is thus linear in the horizon. 5.1. An extension: Adaptive PRADA We can exploit the fact that PRADA returns a sequence of actions to increase planning accuracy ­ in contrast to SST where the actions taken at t > 0 depend on the sampled outcomes. Adaptive PRADA (A-PRADA) examines the best found action sequence of PRADA and decides by simulation whether some action can be deleted such that the expected reward is increased ­ e.g., by deleting actions that do not have significant effects on achieving our goal. We refer the reader for more technical details to http://cs.tu-berlin.de/lang/prada/. Figure 2. A simulated robot plays with cubes of different sizes scattered on a table. Blocks that have fallen off the table cannot be manipulated anymore. dynamics simulator (ODE) that enables a realistic behavior of the blocks. For instance, piles of blocks may topple over or blocks may even fall off table (in which case they become out of reach for the robot). The robot can grab blocks and put them on top of other blocks or on the table. Its actions are affected by noise so that resulting block piles are not straight-lined. We assume full observability of triples (s, a, s ) that specify how the world changed when an action was executed in a certain state. We represent the data with predicates on(X, Y ), block(X), table(X), out(X), inhand(X), upright(X) and function size(X) for state descriptions and puton(X), grab(X) and doN othing() for actions. If there are o objects and f different object sizes, the action space contains 2o + 1 actions while 2 the state space is huge with f o 2o +6o different states (not excluding states one would classify as "impossible" given some intuition about real world physics). 6.2. Experiments We use the rule learning algorithm of Pasula et al. (2007) with the same parameter settings to learn three different sets of fully abstract NID rules from independent training sets of 500 experience triples each. Training data to learn rules are generated in a world of six cubic blocks of three different sizes by performing random actions with a slight bias to build high towers. The resulting rule-sets contain 12, 9 and 13 rules respectively. We perform three experiments with planning goals of increasing difficulty. If a planning algorithm does not find a suitable action in a given situation, we restart the planning procedure: SST builds a new tree and (A-)PRADA takes new action-sequence samples. If after 10 planning runs for a given situation a suitable action still is not found, the trial fails. In each experiment, we test the planners in worlds with varying numbers of blocks of three different sizes. Thus, we transfer the knowledge gained in the training world to different, but similar worlds by using abstract NID rules. In each experiment, we create five start situations with different blocks for each blocks number. 6. Results 6.1. Setup We test SST and (A-)PRADA in an extended simulated blocks world where a robot manipulates blocks scattered on a table (Fig. 2). We use a 3D rigid-body Approximate Inference for Planning in Stochastic Relational Worlds Table 2. Average height problem. Obj. denotes the number of objects (blocks and table) and Reward the discounted total reward. The reward for performing no actions is 8.03. Obj. 6+1 6+1 6+1 6+1 8+1 8+1 8+1 8+1 10+1 10+1 10+1 10+1 13 Discounted total reward 12 11 10 9 8 6 8 Blocks 10 Planner SST (b=3) SST (b=4) PRADA A-PRADA SST (b=3) SST (b=4) PRADA A-PRADA SST (b=3) SST (b=4) PRADA A-PRADA SST b=3 SST b=4 PRADA A-PRADA Reward 10.04±1.23 9.83±1.20 11.01±1.41 10.64±1.59 9.66±0.88 9.87±0.95 10.29±1.14 9.86±1.20 9.72±0.69 9.71±0.80 9.69±0.68 9.96±0.77 Trial time (s) 69.02±14.62 179.99±41.71 9.31±0.34 10.64±1.59 307.62±74.77 796.38±153.16 30.81±2.12 32.94±2.32 1213.76± 511.20 3061.37±1112.13 77.24±6.74 84.47±7.07 Table 3. Tower of three specific blocks problem. Suc. denotes the success rate, Actions the number of executed actions in case of success and Act. time the planning time for single actions ­ in contrast to the trial time in Fig. 4. Obj. 6+1 6+1 6+1 6+1 6+1 6+1 8+1 8+1 8+1 8+1 8+1 8+1 10+1 10+1 10+1 10+1 10+1 10+1 Planner SST (b=1) SST (b=2) SST (b=3) SST (b=4) PRADA A-PRADA SST (b=1) SST (b=2) SST (b=3) SST (b=4) PRADA A-PRADA SST (b=1) SST (b=2) SST (b=3) SST (b=4) PRADA A-PRADA Suc. 0.36 0.60 0.78 0.80 0.73 0.84 0.29 0.47 0.64 0.69 0.73 0.84 0.36 0.62 0.67 0.84 0.76 0.78 Actions 6.38±2.31 5.96±1.37 5.54±2.13 5.11±1.35 5.58±2.15 4.82±1.09 5.54±1.98 5.86±1.49 5.14±1.41 4.81±1.08 6.73±3.47 5.63±1.63 7.62±3.38 6.14±1.92 5.60±1.69 5.42±1.27 7.53±3.55 5.77±1.83 Act. time (s) 0.26±0.06 1.47±0.31 5.80±1.18 14.91±2.78 0.55±0.08 0.76±0.14 1.45±0.61 6.47±1.48 24.36±4.78 65.73±15.14 1.41±0.32 2.12±0.58 5.74±3.87 29.21±15.74 115.83±67.45 293.93±18.13 4.21±1.94 5.28±2.09 10000 1000 Trial time (s) 100 10 1 Success rate 1 6 8 Blocks 10 SST b=1 SST b=2 SST b=3 SST b=4 PRADA A-PRADA 10000 1000 Per rule-set and start situation, we perform three independent runs with different random seeds. Thus, all our reported results are averages over 45 trials (3 rulesets, 5 start situations, 3 runs). All experiments are run on a 2 Ghz. dual-core PC. 6.2.1. Average height First, we repeat the experiment of Pasula et al. which uses the average height of blocks as the reward function. This constitutes an easy planning problem as many different actions may increase the reward (block identities do not matter) and a small planning horizon d is sufficient. We set SST to d = 4 and branching factor b = 3 and b = 4 (the latter was Pasula's choice) and (A-)PRADA to horizon d = 6. Initial states do not contain already stacked blocks. In all our experiments we found that as long as d is not too small, its exact choice does not have significant effects on (A-)PRADA's planning quality. As Table 2 and Fig. 3 show, PRADA and A-PRADA achieve rewards comparable to SST, while their running-time is significantly smaller by an order of magnitude (although affording a longer planning horizon). 6.2.2. Tower of specific blocks We investigate a slightly more difficult problem: the goal is to stack three specific blocks. Start situations are chosen such that the goal can be achieved by means of four actions. We set horizon d = 4 optimal for SST ­ although in principle d, which heavily affects planning Trial time (s) Figure 3. Average height problem 0.5 100 10 0 6 8 Blocks 10 1 6 8 Blocks 10 Figure 4. Tower of three specific blocks problem time, cannot be known a-priori. By contrast, we set d = 16 for (A-)PRADA (using the heuristic to set d to twice the average number of blocks which we found to perform well in similar experiments not reported here). A trial is limited by a maximum number of 20 actions. If the goal is not achieved then or if one of the three blocks to be stacked falls off the table, it fails. Table 3 and Fig. 4 show that SST is either extremely slow (for large b) or its performance is bad (for small b). PRADA planning times scale well, as shown in worlds of 8 and 10 blocks, while maintaining a high planning quality. PRADA planning quality is significantly better than that for SST with small b and comparable to SST with large b ­ while the latter is more than 10-20 times slower. A-PRADA strongly improves planning quality of PRADA and is significantly better than SST with large b = 3, for it avoids senseless actions which may cause blocks to fall off the table. 6.2.3. Reverse tower To explore its limits, we want PRADA to reverse a tower of b blocks which requires at least 2b actions (each block needs to be moved at least once). Apart from the long planning horizon, this is difficult due to Approximate Inference for Planning in Stochastic Relational Worlds the noise in the simulated world: towers can become unstable and topple over with blocks falling of the table. To decrease this noise slightly to obtain more reliable results, we forbid the robot to grab objects that are not clear (i.e., below other objects). We set a limit of 50 actions on each trial. Table 4 presents our results. We cannot get SST to solve this problem even for five blocks (with optimal d = 10). Although the space of possible actions is reduced due to the mentioned restriction, SST has enormous runtimes. With b = 1, SST doesn't find suitable actions (no leaves with the goal state) in several starting situations even when building ten trees (taking about two hours) ­ the increased planning horizon leads to a high probability of sampling at least one bad outcome for a required action. For b 2, a single tree traversal takes more than a day. In contrast, PRADA and A-PRADA often succeed in satisfactory times (with planning horizons d = 20 for 5 blocks and d = 30 for 6 and 7 blocks). When they fail this is mainly due to blocks falling of the table and not because actions cannot be found anymore. A-PRADA can reverse a tower of 7 blocks with a success rate > 1 and trial time about 5 1 minutes. 2 2 framework. Advances in Neural Information Processing Systems (NIPS) (pp. 169­176). Croonenborghs, T., Ramon, J., Blockeel, H., & Bruynooghe, M. (2007). Online learning and exploiting relational models in reinforcement learning. Proc. of the Int. Conf. on Artificial Intelligence (IJCAI) (pp. 726­731). Dayan, P., & Hinton, G. E. (1997). Using expectationmaximization for reinforcement learning. Neural Computation, 9, 271­278. Gardiol, N. H., & Kaelbling, L. P. (2007). Action-space partitioning for planning. Proc. of the Nat. Conf. on Artificial Intelligence (AAAI) (pp. 980­986). Getoor, L., & Taskar, B. (Eds.). (2007). Introduction to statistical relational learning. MIT Press. Kearns, M. J., Mansour, Y., & Ng, A. Y. (2002). A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning, 49, 193­208. Kersting, K., van Otterlo, M., & de Raedt, L. (2004). Bellman goes relational. Proc. of the Int. Conf. on Machine Learning (ICML) (pp. 465­472). Murphy, K. P., & Weiss, Y. (2001). The factored frontier algorithm for approximate inference in DBNs. Proc. of the Conf. on Uncertainty in Artificial Intelligence (UAI) (pp. 378­385). Pasula, H. M., Zettlemoyer, L. S., & Kaelbling, L. P. (2007). Learning symbolic models of stochastic domains. Artificial Intelligence Research, 29, 309­352. Sanghai, S., Domingos, P., & Weld, D. (2005). Relational dynamic Bayesian networks. Artificial Intelligence Research, 24, 759­797. Sanner, S., & Boutilier, C. (2007). Approximate solution techniques for factored first-order MDPs. Proc. of the Int. Conf. on Automated Planning and Scheduling (ICAPS) (pp. 288­295). Toussaint, M., & Storkey, A. (2006). Probabilistic inference for solving discrete and continuous state Markov decision processes. Proc. of the Int. Conf. on Machine Learning (ICML) (pp. 945­952). van Otterlo, M. (2009). The logic of adaptive behavior. IOS Press, Amsterdam. Table 4. Reverse tower problem. Suc. is the success rate and Actions the number of used actions in case of success. Obj. 5+1 5+1 5+1 5+1 6+1 6+1 7+1 7+1 Planner SST (b=1) SST (b=2) PRADA A-PRADA PRADA A-PRADA PRADA A-PRADA Suc. 0.0 0.0 0.84 0.78 0.42 0.49 0.47 0.56 Trial time (s) > 1 day 79.9±26.5 66.3±15.6 184.9±51.9 190.4±49.8 415.9±186.3 331.6±118.3 Actions 12.6±2.9 10.6±1.4 14.6±2.5 12.8±1.7 18.1±5.1 14.8±1.8 7. Conclusions and Outlook We have introduced an efficient planning method for the probabilistic relational rules proposed by Pasula et al. (2007) based on approximate inference in DBNs. This enables us to learn the dynamics of a complex stochastic world and quickly derive appropriate actions for varying goals. Results in a 3D simulated blocksworld with an articulated manipulator and realistic physics show that our method outperforms existing approaches. Our approach relies on working in the full ground representation. To apply it in domains with very many objects, it needs to be combined with methods that reduce the state and action space complexity in relational domains, see e.g. Gardiol and Kaelbling (2007). Learning rule-sets online and exploiting them immediately by our planning method is another direction of future research as well as extensions that manipulate sampled action sequences. Acknowledgments We would like to thank the anonymous reviewers for their helpful comments. This work was supported by the German Research Foundation (DFG), Emmy Noether fellowship TO 409/1-3. References Botvinick, M. M., & An, J. (2009). Goal-directed decision making in prefrontal cortex: a computational