Probabilistic Backward and Forward Reasoning in Stochastic Relational Worlds Tobias Lang lang@cs.tu-berlin.de Marc Toussaint mtoussai@cs.tu-berlin.de TU Berlin, Machine Learning and Robotics Group, Franklinstraße 28/29, 10587 Berlin, Germany Abstract Inference in graphical models has emerged as a promising technique for planning. A recent approach to decision-theoretic planning in relational domains uses forward inference in dynamic Bayesian networks compiled from learned probabilistic relational rules. Inspired by work in non-relational domains with small state spaces, we derive a backpropagation method for such nets in relational domains starting from a goal state mixture distribution. We combine this with forward reasoning in a bidirectional two-filter approach. We perform experiments in a complex 3D simulated desktop environment with an articulated manipulator and realistic physics. Empirical results show that bidirectional probabilistic reasoning can lead to more efficient and accurate planning in comparison to pure forward reasoning. lational reinforcement learning framework (Dzeroski et al., 2001) which investigates the use of compact relational representations for state and action spaces. We are interested in model-based approaches enabling agents with changing goals to plan for the goal at hand by internal simulation. Regarding the learning of such world models, Pasula et al. (2007) have proposed noisy indeterministic deictic (NID) rules. These rules extend probabilistic STRIPS operators (Kushmerick et al., 1995) by two special constructs essential for learning: a noise outcome to avoid modeling of rare and overly complex outcomes and deictic references which refer to objects other than the action arguments and allow for more compact rule-sets. Actions are modeled by different rules whose preconditions do not have to be mutually exclusive; in a specific situation, we require a unique rule with satisfied preconditions and unique deictic references to avoid contradicting predictions. To plan with these learnt rules, one can formalize the problem as a relational Markov decision process (MDP). The field of Symbolic Dynamic Programming (Boutilier et al., 2001) investigates methods to compute policies over complete state and action spaces working in the lifted abstract representation without grounding or referring to particular object instances (Kersting et al., 2004; Sanner & Boutilier, 2009). As an alternative, several methods exist for reasoning in the grounded domain, which makes it straightforward to account for the noise outcome and the uniqueness requirement of NID rules. The forward planners SST (Kearns et al., 2002) and UCT (Kocsis & Szepesvari, 2006) sample outcomes to cope with the stochasticity of grounded actions. PRADA (Lang & Toussaint, 2009) converts NID rules into a grounded Dynamic Bayesian net (DBN) representation and propagates action effects forward by means of a factored frontier filter (Murphy & Weiss, 2001). Backward planning in grounded domains, in contrast, has largely focussed on non-probabilistic domains where action outcomes are not probabilistically determined. Classical approaches use regression techniques to map action operators and 1. Introduction Intelligent agents have to accomplish two tasks to act autonomously in complex worlds: First, they need to learn how their environment works. Second, they have to use the acquired knowledge efficiently to decide which actions to take to achieve the goals and maximize the expected rewards. Research on the first task has led to probabilistic relational knowledge representations: these can deal with stochastic actions, cope with noise and generalize over objects and situations. In spite of considerable progress over recent years, designing efficient algorithms for reward maximization in such learned world models is still a major concern. Action decision-making (Boutilier et al., 1999) in stochastic relational domains is often cast in a reAppearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Probabilistic Backward and Forward Reasoning in Stochastic Relational Worlds state variables to formulas describing the conditions under which a variable becomes true (Rintanen, 2008). Our idea in this paper is to derive a backpropagation method to enable bidirectional reasoning in probabilistic relational domains. This may drastically prune the search space of action sequences: instead of search spaces which are exponential in the length d of the complete plans, we only have to consider search spaces which are exponential in d . Furthermore, backward 2 reasoning is particularly useful in problems where the number of possible actions close to goal states is small in comparison to the start state. Consider for instance the goal to build a tower from objects which are initially scattered over a table. Unfortunately, previous approaches such as the forward-backward algorithm in Hidden Markov models (HMMs) (Rabiner, 1989) and the planning by inference paradigm (Toussaint & Storkey, 2006) in non-relational MDPs work in limited small state spaces and are not applicable in DBNs for grounded relational domains, where exact inference is infeasible. The factored frontier inference of PRADA cannot be used directly for backpropagation, either. A core challenge is how to condition the state distribution at the last time-step on receiving a high reward when using approximate inference. This problem arises in particular for complex abstract reward dependencies such as partial goal descriptions. We make three contributions to overcome these problems: (i) We show how to use NID rules to learn a probabilistic backward model. (ii) We model arbitrary (partial) goal descriptions with a mixture state distribution and derive a probabilistic backward reasoning procedure. (iii) We introduce a two-filter (Solo, 1982) inference method to use bidirectional reasoning for planning in stochastic relational domains. The remainder of this paper is organized as follows: First, we review the theoretical background, namely NID rules and the PRADA planning algorithm. In Section 3, we present our two-filter using backward reasoning with NID rules. In Section 4, we introduce the bidirectional planning approach. Then, we show our empirical results before we conclude. A concrete instantiation of a relational domain is made up of a finite set of objects O. If the arguments of a predicate or function are all concrete, i.e. taken from O, we call it grounded. A concrete world state s is fully described by all grounded predicates and functions. Concrete actions a are described by positive grounded predicates from A. The arguments of predicates and functions can also be abstract logical variables which can represent any object. If a predicate or function has only abstract arguments, we call it abstract. We will speak of grounding an abstract formula if we apply a substitution that maps all of the variables appearing in to objects in O. 2.2. Noisy Indeterministic Deictic Rules Noisy Indeterministic Deictic (NID) (Pasula et al., 2007) rules are a model of the transition dynamics in relational domains. Table 1 shows an exemplary NID rule for a desktop environment. A NID rule r is given as follows : r,1 (X ) pr,1 . . . , (1) ar (X ) : r (X ) pr,m : r,m (X ) r r pr,0 : r,0 where X is a set of logic variables in the rule (which represent a (sub-)set of abstract objects). The rule r consists of preconditions, namely that action ar is applied on X and that the state context r is fulfilled, and mr +1 different outcomes with associated probabilities pr,i > 0, i=0 pr,i = 1. Each outcome r,i (X ) describes which predicates and functions change when the rule is applied. The context r (X ) and outcomes r,i (X ) are conjunctions of literals constructed from the predicates in P as well as equality statements comparing functions from F to constant values. In contrast to the outcomes, the context may contain derived predicates and functions, enabling complex and abstract situation descriptions. The so-called noise outcome r,0 subsumes all possible action outcomes which are not explicitly specified by one of the other Table 1. Example NID rule for a desktop world, which models dropping an object X over an object Y in a situation where Y is below an object Z (deictic referencing). With high probability, X will land on Z, but might also fall on the table. With a small probability something unpredictable happens. dropabove(X, Y ) : ( inhand(X), on(Z, Y ), table(T ) 0.6 0.3 0.1 : : : on(X, Z), ¬inhand(X) on(X, T ), ¬inhand(X) noise 2. Background 2.1. State and Action Representation A relational domain is represented by a relational logic language L: the set of logical predicates P and the set of logical functions F contain the relationships and properties that can hold for domain objects. We distinguish between primitive and derived concepts. The latter are defined in terms of formulas over primitive or other derived concepts. The set of logical predicates A comprises the possible actions in the domain. Probabilistic Backward and Forward Reasoning in Stochastic Relational Worlds (a) Forward DBN (b) Backward DBN Figure 1. PRADA converts NID rules into a forward DBN (a) to predict the effects of action sequences. For backward reasoning, we use a second DBN (b) with the same state, action and reward variables, but different rule variables (B , B and OB ) according to the backward rules. i PRADA represents each grounded predicate or funct tion at time t as a random variable Si with value st . i Random variables A for actions, for rules, for preconditions and O for outcomes model the transitions between states St and St+1 at subsequent timesteps. PRADA uses a factored frontier (Murphy & Weiss, 2001) approximating the posterior P (st | a0:t-1 ) to efficiently propagate the effects of action sequences forward. The reward gained in a state is represented by a binary variable R which can be used to express arbitrary reward expectations. For planning, i.e., to find an action sequence a with large P (R | a), PRADA samples action-sequences in an informed way, taking predicted state distributions into account: action a is sampled at time-step t with a probability proportional to the probability that a has a unique covering rule in which case one can make meaningful predictions. 3. Two-Filter Smoothing using Backward NID Rules We propose to adapt PRADA for backward reasoning using NID rules. Then, we can exploit the knowledge about the goal for planning. 3.1. Backward Messages for a Two-Filter Existing approaches for combining probabilistic backward and forward reasoning calculate smoothed state posteriors conditioned on a sequence of observed variables, e.g. the forward-backward algorithm in HMMs (Rabiner, 1989) or Expectation-Maximization used for planning by inference in small state spaces (Toussaint & Storkey, 2006). Here, we want to calculate posteriors P (st , RT | a0:T -1 ) where T is the last time-step which we can use to evaluate an action-sequence by P (RT | a0:T -1 ) = st r,i . The arguments of the action a(Xa ) may be a true subset Xa X of the variables X of the rule. The remaining variables are called deictic references D = X \ Xa and denote objects relative to the agent or action being performed. Let denote a substitution that maps variables to constant objects, : X O. Applying to an abstract rule r(X ) yields a grounded rule r((X )). We say a grounded rule r covers a state s and a grounded action a if s |= r and a = ar . If r is the unique covering rule for a in s, it defines P (s |a, s), the probability of a successor state s if action a is performed in state s, according to r's outcomes and their probabilities. If there is no unique covering rule, the effects of a are explained as noise by a default rule. NID rules can be learned from experience triples (s, a, s ) using a batch algorithm that trades off the likelihood of these triples with the complexity of the learned rule-set. 2.3. Planning using Probabilistic Inference The PRADA algorithm (Lang & Toussaint, 2009) plans forward in stochastic relational domains using a set of NID rules that predict state transitions. It copes efficiently with stochastic action outcomes by means of probabilistic inference. It converts NID rules into a structured DBN representation (Fig. 1(a)). To clarify notation, we denote random variables by upper case letters (e.g. S), their values by the corresponding lower case letters (e.g. s dom(S)), variable tuples by bold upper case letters (e.g. S = (S1 , S2 , S3 )) and value tuples by bold lower case letters (e.g. s = (s1 , s2 , s3 )). We also use column notation (e.g. s3:5 = (s3 , s4 , s5 )). P (st , RT | a0:T -1 ) . (2) These posteriors can be calculated by means of forward messages a0:t-1 (st ) := P (st | a0:t-1 ) and backward messages at:T -1 (st ) := P (RT | st , at:T -1 ) such that P (RT , st | a0:T -1 ) = a0:t-1 (st ) · at:T -1 (st ). It is intractable to calculate these messages exactly in relational domains due to the immense state spaces. PRADA calculates the forward messages approximately using a factored frontier filter. Unfortunately, PRADA's specific factored frontier equations only work for forward reasoning and cannot be applied for calculating the likelihood backward messages . We might use rejection sampling in PRADA's forward DBN, but this is highly inefficient. It is in general unclear how to calculate the even approximately in a tractable way in all but the smallest state spaces. Therefore, as an alternative we propose Probabilistic Backward and Forward Reasoning in Stochastic Relational Worlds a filtering approach for backward reasoning. We use PRADA in reversed order, providing us with messages ^ at:T -1 (st ) := P (st | at:T -1 , RT ). This requires a set of backward NID rules from which we can build a backward DBN (Fig. 1(b)) and apply PRADA's factored frontier inference (see next section). A two-filter (Solo, 1982; Briers et al., 2009) uses the re^ sulting messages to approximate the likelihood messages , at:T -1 (st ) = P (RT | st , at:T -1 ) = P (s | R , a )P (R | a ) P (st | at:T -1 ) P (st | RT , at:T -1 )P (RT | at:T -1 ) = P (st ) P (st | RT , at:T -1 )P (RT ) P (s) t T ^ P (s | R , at:T -1 ) = at:T -1 (st ) . t T t:T -1 T t:T -1 pickup(A) A A B C takefrom(A,B) B C (3) Figure 2. Using a unary action predicate pickup(A), it is impossible to deduce from the successor state (via a deictic reference) whether A was taken from B or C. This information is captured explicitely in the binary action predicate takef rom(A, B). Due to its action sampling strategy, extending the arity of actions does not influence PRADA's planning efficiency. sor state s , it is impossible to determine Y . This can be solved by increasing the arity of the action predicate so that less deictic references need to be resolved. For this reason, in our experiments we will use binary action predicates takef rom(X, Y ) and dropabove(X, Y ) instead of grab(X) and puton(Y ) while still allowing for deictic referencing to third or fourth objects. At first glance, one might suspect that extending the action predicate arity increases the planning complexity due to the increased action space. This is resolved, however, when using a policy that only considers actions with unique rules. As we regularize our rule learning procedure, the learned rules model typical state transitions. Thus, a planner using these rules takes actions only in frequently observed contexts into account, effectively pruning large parts of the action space in a given situation. Furthermore, determining all unique covering rules has the same computational cost, independently of whether Y is used as a second action argument or as a deictic reference. The approximations of this two-filter are due to the intractable state distributions P (st ) at specific timesteps t, also required to account for the dependencies of RT on at:T -1 . Since in planning we are interested in ranking different action sequences a, we can drop the likewise intractable reward marginal P (RT ), as we can drop P (s) assuming a uniform state prior. 3.2. Backward Rules To use PRADA for backward filtering, we require a set of backward NID rules to define a distribution P (s | s , a) over precessor state s if an action a was applied before the current state s . These rules take exactly the same form as in the forward case, only the semantics are changed. Given a forward model, one might try to invert the according rules. How to account for the special characteristics of NID rules such as uniqueness and noise outcomes in this case is unclear, however. As our forward models are learned and thus in any event approximations of the true underlying dynamics, we propose to learn the backward rules directly from data as well. This has the advantage that we can use the same algorithm which we already use for learning the forward rules. We only have to provide the experience triples in reversed order (s , a, s). Depending on the domain, state transitions may be easier to model in one direction than the other. This may affect the deictic references in NID rules which may be unique only in one direction. Consider for instance the unary predicate pickup(X) used by Pasula et al. (Fig. 2). A forward rule could use a deictic reference Y to describe where X was taken from which is required to conclude ¬on(X, Y ) for the successor state. When reasoning backward, looking only at the succes- 4. Backward-Forward Planning We use a two-filter to plan in stochastic relational domains. Given a backward model B P (s | s , a) in form of NID rules for a state s, an action a and successor state s , we apply PRADA first to reason backward to estimate a distribution of states backward-reachable from goal states. Then, we use a second set of NID rules specifying a forward model A P (s | s, a) to reason forward from the initial state to find action sequences leading to states close to a goal state. 4.1. Goal State Distributions For probabilistic backward reasoning, we require a state distribution at the last time-step T . We are interested in states achieving a high reward, namely ^ T (s) := P (s | RT ) . (4) In contrast to previous work on planning by inference, ^ we cannot calculate T (s) exactly in relational do- Probabilistic Backward and Forward Reasoning in Stochastic Relational Worlds mains due to the large state spaces. Thus, we approx^ ^ imate it with a factored frontier T (s) i T (si ). If the goal fully specifies the final state, setting ^ the marginals T (si ) to their deterministic values is straightforward. If the goal is defined in terms of a partial state description in form of a conjunction of primitive predicates and functions, only some state attributes s s have deterministic values. The situation becomes more difficult to deal with if the goal is specified in terms of a derived predicate, corresponding to formulas over primitive predicates, such as existentially quantified goals. Consider for instance the goal to stack the cubes {a, b, c} in any order. In this case, the clearly dissimilar states s1 = {on(a, b), on(b, c)} and s2 = {on(c, b), on(b, a)} yield the same reward. If we approximate the final state belief by marginals, we lose the crucial correlations among the variables. This is a general problem in backward planning and arises likewise in non-probabilistic and propositional domains. A common strategy there is to pick arbitrary grounded forms of the goal, e.g. choosing s1 in the example above. This has the pitfall that some goal groundings may not be reachable or more costly to reach from the given state. To avoid these problems and achieve a closer approximation of the goal ^ state distribution T , we approximate it by means of ^T a mixture model with individual components c , 1 ^ T (sT ) C C may lead to very small rule context probabilities, decreasing the probabilities of unique rules. 4.2. Backward Messages ^T For each component c of the mixture model approx^T given in Eq. (5), we sample N backimation of T ward action sequences bci = (bT -1 , . . . , bci-D ) of ci horizon D using PRADA and the backward model ^T B, where we set c as the initial state distribution. One such sample bci results in the distribution ^t bci (st ) = P (st | bci , c , RT ). We do not want to evaluate a forward action sequence a with each backward action sequence bci individually. Hence, we approxi^ mate state posteriors t (st ) = P (st | RT ) generalizing over concrete action sequences as 1 ^ t (st ) C C c=1 1 N N ^t bci (st ) . i=1 (7) ^ The resulting define the probability of states according to backward reasoning from the goal state mixture ^ distribution T using PRADA's action sampling strategy. They quantify which states are actually backward reachable from goal states. 4.3. Evaluating Forward Sequences ^ Having calculated the backward messages t (st ), we sample forward action sequences a0:t-1 using the forward model A and PRADA yielding messages a0:t-1 (st ) = P (st | a0:t-1 ). For each a we are interested in its suitability to achieve a goal state at time T with t T . We use the two-filter of Sec. 3.1 with the backward state distribution of Sec. 4.2 to calculate P (RT | a0:t-1 ) = st ^T c (sT ). c=1 (5) The components c are built from conjunctions c over grounded primitive predicates and functions which partially describe world states achieving high reward. For instance, c might define the tower in s2 above. We choose these formulas c without taking the initial state s0 or knowledge about actions in terms of rules into account to separate this clearly from planning. Concerning unspecified properties of the final state, we ^T use a prior P F (s) and define the component c as ^T c (s) s,c P F (s), s,c = 1 if s c 0 otherwise . (6) P (st | a0:t-1 )P (RT | st ) st (8) (9) ^ a0:t-1 (st ) T -t (st ) . Representing the messages by means of factored fron^ ^ tiers (s) = i (si ) and (s) = i (si ) (dropping indices for clarity), where i ranges over the individual state attributes, we calculate this sum over messages products as ^ (s)(s) = s s i We choose P F (s) such that states close to the initial state s0 are highly probable. This is inspired by traditional A.I. backward planning: there, state variables not in c are left unspecified until either they need to be set as required by the preconditions of a rule during backward search or until the initial state is achieved in which case all unspecified variables in the final state implicitely get set to their values in the initial state. This assumption is also advantageous for the factored frontier as the repeated multiplication of small probabilities (such as uninformed 0.5 for binary variables) ^ (si )(si ) ^ (si )(si ) i si (10) (11) ^ c (si ) = = i si (si ) 1 C 1 C C (12) c=1 C = i (si ) si c=1 ^ c (si ) . (13) Probabilistic Backward and Forward Reasoning in Stochastic Relational Worlds Figure 3. In our experiments, a robot has to master different tasks in a 3D simulated complex desktop environment involving cubes, balls and boxes of different sizes. 4.4. Action Selection For a set A = {a1 , . . . , aM } of forward action sequence samples of length D , we determine the best action sequence a defined as a = argmax P (R | a) aA (14) P (T )P (RT | a0:t-1 ), (15) T = argmax aA 0