Hierarchical POMDP Controller Optimization by Likeliho o d Maximization Marc Toussaint Computer Science TU Berlin Berlin, Germany mtoussai@cs.tu-berlin.de Laurent Charlin Computer Science University of Toronto Toronto, Ontario, Canada lcharlin@cs.toronto.edu Pascal Poupart Computer Science University of Waterloo Waterloo, Ontario, Canada ppoupart@cs.uwaterloo.ca Abstract Planning can often be simplified by decomposing the task into smaller tasks arranged hierarchically. Charlin et al. [4] recently showed that the hierarchy discovery problem can be framed as a non-convex optimization problem. However, the inherent computational difficulty of solving such an optimization problem makes it hard to scale to realworld problems. In another line of research, Toussaint et al. [18] developed a method to solve planning problems by maximumlikelihood estimation. In this paper, we show how the hierarchy discovery problem in partially observable domains can be tackled using a similar maximum likelihood approach. Our technique first transforms the problem into a dynamic Bayesian network through which a hierarchical structure can naturally be discovered while optimizing the policy. Experimental results demonstrate that this approach scales better than previous techniques based on non-convex optimization. not always known or easy to specify, and the optimal policy may only decompose approximately. To that effect, Charlin et al. [4] showed how a hierarchy can be discovered automatically by formulating the planning problem as a non-convex quartically constrained optimization problem with variables corresponding to the parameters of the policy, including its hierarchical structure. Unfortunately, the inherent computational difficulty of solving this optimization problem prevents the approach from scaling to real-world problems. Furthermore, it is not clear that automated hierarchy discovery simplifies planning since the space of policies remains the same. We propose an alternative approach that demonstrates that hierarchy discovery (i) can be done efficiently and (ii) performs a policy search with a different bias than non-hierarchical approaches that is advantageous when there exists good hierarchical policies. The approach combines Murphy and Paskin's [10] factored encoding of hierarchical structures (see also [17]) into a dynamic Bayesian network (DBN) with Toussaint et al.'s [18] maximum-likelihood estimation technique for policy optimization. More precisely, we encode POMDPs with hierarchical controllers into a DBN in such a way that the policy and hierarchy parameters are entries of some conditional probability tables. We also consider factored policies that are more general than hierarchical controllers. The policy and hierarchy parameters are optimized with the expectationmaximization (EM) algorithm [5]. Since each iteration of EM essentially consists of inference queries, the approach scales easily. Sect. 2 briefly introduces partially observable Markov decision processes, controllers and policy optimization by maximum likelihood estimation. Sect. 3 reviews previous work on hierarchical modeling and how to use a dynamic Bayesian network to encode a hierarchical structure. Sect. 4 describes our proposed approach, which combines a dynamic Bayesian network encoding with maximum likelihood estimation to simultane- 1 Intro duction Planning in partially observable domains is notoriously difficult. However, many planning tasks naturally decompose into subtasks that may be arranged hierarchically. For instance, the design of a soccer playing robot is often decomposed into low-level skills such as intercepting the ball, controlling the ball, passing the ball, etc. [16]. Similarly, prompting systems that assist older adults with activities of daily living (e.g., handwashing [8]) can be naturally decomposed into subtasks for each step of an activity. When a decomposition or hierarchy is known a priori, several approaches have demonstrated that planning can be simplified and performed faster [13, 7]. However, the hierarchy is ously optimize a hierarchy and the controller. Sect. 5 demonstrates the scalability of the proposed approach on benchmark problems. Finally, Sect. 6 summarizes the paper and discusses future work. 2.2 Finite State Controllers 2 Background Throughout the paper we denote random variables by upper case letters (e.g., X ), values of random variables by their corresponding lower case letters (e.g., x dom(X )) and sets of values by upper case letters with math calligraphy (e.g., X = {x1 , x2 , x3 }). We now review POMDPs (Sect. 2.1), how to represent policies as finite state controllers (Sect. 2.2) and how to optimize bounded controllers (Sect. 2.3). 2.1 POMDPs Partially observable Markov decision processes (POMDPs) provide a natural and principled framework for planning. A POMDP can be formally wefined by a tuple S, A, O, ps , ps |as , po |s a, ras d here S is the set of states s, A is the set of actions a, O is the set of observations o, ps = Pr(S0 = s) is the initial state distribution (a.k.a. initial belief ), ps |as = Pr(St+1 = s | At = a, St = s) is the transition distribution, po |s a = Pr(Ot+1 = o | St+1 = s , At = a) is the observation distribution and ras = R(At = a, St = s) is the reward function. Throughout the paper, it is assumed that S , A and O are finite and discrete. The goal is to select actions to maximize the rewards. At any point in time, the information available to select the next action consists of the history of past actions and observations. Hence a policy is defined as a mapping from histories to actions. However, since histories grow with time, it is common practice to summarize histories with a fixed-length sufficient statistic such as the belief distribution bs = Pr(S = s), which corresponds to the state distribution (conditioned on the history of past actions and observations). The belief distribution b can be updated at each time step, based on the action a taken and the observation o made according s s to Bayes' theorem: bao = k bs ps |as po |s a (k is a normalization constant). Policies can then be defined as mappings from beliefs to actions (e.g., (b) = a). The value V (b) of a policy starting in belief b is measured by tt e discounted sum of h expecteds rewards: V (b) = t Ebt | [r(bt )bt ] where rab = bs ras . An optimal policy is a policy with the highest value V for all beliefs: V (b) V , b. The optimal value function also satio fies Bellman's s ao ) equation: V (b)s = maxa rab + po |ab V (b where po |ab = s bs ps |as po |s a . A convenient representation for an important class of policies consists of finite state controllers [6]. Instead of using beliefs as sufficient statistics of histories, the idea is to use a finite internal memory to retain relevant bits of information from histories. Each configuration of this memory can be thought of as a node in a finite state controller, where nodes select actions to be executed and edges indicate how to update nodes based on the observations received. A controller with a finite set N of nodes n can encode a stochastic policy with three distributions: Pr(N0 = n) = pn (initial node distribution), Pr(At = a | Nt = n) = pa|n (action selection distribution) and Pr(Nt+1 = n | Nt = n, Ot+1 = o ) = pn |no (successor node distribution). Such a policy can be executed by starting in a node n sampled from pn , executing an action a sampled from pa|n , receiving observation o , transitioning to node n sampled from pn |no and so on. The value of a controller can be coma putes by solving a linear system: Vns = d pa|n [ras + |as po |s apn |no Vn s ] ns. The value at a o n ps ns given belief b is then V (b) = bs pn Vns . 2.3 Policy Optimization Several techniques have been proposed to optimize controllers of a given size, including gradient ascent [9], stochastic local search [2], bounded policy iteration [14], non-convex quadratically constrained optimization [1] and likelihood maximization [18]. We briefly describe the last technique since we will use it in Sect. 4. Toussaint et al. [18] recently proposed to convert POMDPs into equivalent dynamic Bayesian networks (DBNs) by normalizing the rewards and to optimize a policy by maximizing the likelihood of the nor~ malized rewards. Let R be a binary variable corresponding to normalized rewards. The reward function ras is then replaced by a reward distribution ~ pr|sat = Pr(R = r | At = a, St = s, T = t) that as~ ~ ~ signs probability ras /(rmax - rmin ) to R = 1 and ~ 1 - ras /(rmax - rmin ) to R = 0 (rmin = minas ras and rmax = maxas ras ). An additional time variable T is introduced to simulate the discount factor and the summation of rewards. Since a reward is normally discounted by a factor t when earned t time steps in the future, the prior pt = Pr(T = t) iset to t (1 - ) where s the factor (1 - ) ensures that t=0 pt = 1. The resulting dynamic Bayesian network is illustrated in Fig. 1. It can be thought of as a mixture of finite processes of ~ length t with a 0-1 reward R earned at the end of the process. The nodes Nt encode the internal memory of the controller. Given the controller distributions pn , pa|n and pn |no , it is possible to evaluate the controller N0 N0 N1 T =0 A0 T =1 A0 O1 A1 S0 ^ R S0 S1 ^ R ... N0 N1 Ntmax T = tmax we will empirically compare our approach to the nonconvex optimization techniques used to optimize recursive controllers. In another line of research, Murphy and Paskin [10] proposed to model hierarchical hidden Markov models (HMMs) with a dynamic Bayesian network (DBN). Theocharous et al. [17] also used DBNs to model hierarchical POMDPs. We briefly review this DBN encoding in Sect. 3.2 since we will use it in our approach to model factored controllers. A0 O1 A1 ... Otmax Atmax 3.1 S0 S1 Stmax ^ R Recursive Controllers Figure 1: POMDP represented as a mixture of finite DBNs. For an infinite horizon, a large enough tmax can be selected at runtime to ensure that the approximation error is small. ~ by computing the likelihood of R = 1. More precisely, ~ V (ps ) = (Pr(R = 1) - rmin )/[(rmax - rmin )(1 - )]. Optimizing the policy can be framed as maximizing ~ the likelihood of R = 1 by varying the distributions pn , pa|n and pn |no encoding the policy. Toussaint et al. use the expectation-maximization (EM) algorithm. Since EM is guaranteed to increase the likelihood at each iteration, the controller's value increases monotonically. However, EM is not guaranteed to converge to a global optimum. An important advantage of the EM algorithm is its simplicity both conceptually and computationally. In particular, the computation consists of inference queries that can be computed using a variety of exact and approximate algorithms. 3 Hierarchical Mo deling While optimizing a bounded controller allows an effective search in the space of bounded policies, such an approach is clearly suboptimal since the optimal controller of many problems grows doubly exponentially with the planning horizon and may be infinite for infinite horizons. Alternatively, hierarchical representations permit the representation of structured policies with exponentially fewer parameters. Several approaches were recently explored to model and learn hierarchical structures in POMDPs. Pineau et al. [13] sped up planning by exploiting a user specified action hierarchy. Hansen et al. [7] proposed hierarchical controllers and an alternative planning technique that also exploits a user specified hierarchy. Charlin et al. [4] proposed recursive controllers (which subsume hierarchical controllers) and an approach that discovers the hierarchy while optimizing a controller. We briefly review recursive controllers in Sect. 3.1 since A recursive controller [4] consists of a recursive automaton with concrete nodes n and abstract nodes n. ¯ Abstract nodes call a subcontroller before selecting an action. A controller is said to be recursive when it can call itself, essentially encoding an infinite hierarchy. Formally, a recursive controller is parametrized by an action selection distribution for each node (e.g., pa|n and pa|n ), a successor node distributions for each ¯ node (e.g., pn |no and pn |no ) and a child node dis¯ tribution for each abstract node (e.g., pn |n )1 . Exe¯ cution of a recursive controller is performed by executing the action selected by each node visited and continuing to the successor node selected by the observation made. However, when an abstract node is visited, before executing the action selected, its subcontroller is called and started in the child node selected by the child node distribution. A subcontroller returns control to its parent node when a special end node is reached. Charlin et al. [4] show that optimizing a recursive controller with a fixed number of concrete and abstract nodes can be framed as a non-convex quartically constrained optimization problem. The hierarchical structure is discovered as the controller is optimized since the variables of the optimization problem include the child node distributions which implicitly encode the hierarchy. Three techniques based on a general non-linear solver, a mixed-integer non-linear approximation and a form of bounded hierarchical policy iteration are experimented with, but do not scale beyond toy problems. Furthermore, Charlin et al. do not demonstrate whether searching in the space of hierarchical controllers can speed up planning. Although it is clear that planning is simplified when a hierarchy is given a priori since the policy space is reduced, it is not clear that hierarchy discovery is beneficial since the policy space remains the same while the parameter space changes. In Sect. 5, we demonstrate that hierarchy discovery can be beneficial when a simple hierarchical policy of high value exists. 1 The pa|n and pn |no distributions are combined in one distribution pn a|no in [14] 1 S0 1 S1 1 S2 E0 E1 ... 1 S2 the tuple pa|nl , pnl-1 |nl , pn l |nl o l. The conditional probability distributions of the mixture of DBNs (denoted by p) are: ^ · transition distribution: ps |as = ps |as ^ 0 S0 0 S1 · observation distribution: po |s a = po |s a ^ · reward distribution: pr|as = (ras - rmin )/(rmax - rmin ) ^~ · mixture distribution: pt = (1 - ) t ^ · action distribution: pa|n0 = pa|n0 ^ O0 O1 O2 Figure 2: HMM. 3.2 DBN encoding of a 2-level hierarchical Hierarchical HMMs Murphy and Paskin [10] proposed to model hierarchical hidden Markov models (HMMs) as dynamic Bayesian networks (DBNs). The idea is to convert a hierarchical HMM of L levels into a dynamic Bayesian network of L state variables, where each variable encodes abstract states at the corresponding level. Here, abstract states can only call sub-HMMs at the previous level. Fig. 2 illustrates a two-level hierarchical l HMMs encoded as a DBN. The state variables St are indexed by the time step t and the level l. The Et variables indicate when a base-level sub-HMM has ended, returning its control to the top level HMM. The toplevel abstract state transitions according to the top HMM, but only when the exit variable Et indicates that the base-level concrete state is an exit state. The base-level concrete state transitions according to the base-level HMM. When an exit state is reached, the next base-level state is determined by the next toplevel abstract state. Factored HMMs subsume hierarchical HMMs in the sense that there exists an equivalent factored HMM for every hierarchical HMM. In Sect. 4.1, we will use a similar technique to convert hierarchical controllers into factored controllers. · base lp vel node distribution: pn 0 |n0 n 1 o e0 e ^ 0 if e = exit n 0 |n 1 = pn 0 |o n0 otherwise · mi le level node distribution: pn 1 |n1 n 2 o e0 e1 dd ^ if e1 = exit pn 1 |n 2 p 1 n1 if e0 = exit and e1 = exit = n |o n 1 n1 otherwise · top lep el node distribution: pn 2 |o n2 e1 v ^ if e1 = exit n 2 |o n2 = n 2 n2 otherwise · base-l1 vel exit distribution: pe0 |n0 e ^ if n0 is an end node = 0 otherwise · middl1 -level exit distribution: pe1 |n1 e0 e ^ if e0 = exit and n1 is an end node = 0 otherwise l While the Et variables help clarify when the end of a sub-controller is reached, they are not necessary. Eliminating them yields a simpler DBN illustrated in Fig. 3b. The conditional probability distributions of the Ntl variables become: 4 Factored Controllers · = se lp vel node distribution: pn 0 |n0 n 1 o ba e ^ 0 |n 1 if n0 is an end node n pn 0 |o n0 otherwise = dd · mi le level node distribution: pn 1 |n1 n 2 o ^ 1 0 if n and n are end nodes pn 1 |n 2 pn 1 |o n1 if n0 is an end node, but not n1 n 1 n1 otherwise We propose to combine the DBN encoding techniques of Murphy et al. [10] and Toussaint et al. [18] to convert a POMDP with a hierarchical controller into a mixture of DBNs. The hierarchy and the controller are simultaneously optimized by maximizing the reward likelihood of the DBN. We also consider factored controllers which subsume hierarchical controllers. 4.1 DBN Enco ding · top lep el node distribution: pn 2 |n2 o e1 v ^ 1 0 if n and n are end nodes n 2 |n2 o = n 2 n2 otherwise Note that ignoring the above constraints in the conditional distributions yields a factored control ler that is more flexible than a hierarchical controller since the conditional probability distributions of the Ntl variables do not have to follow the structure imposed by a hierarchy Fig. 3a illustrates two consecutive slices of one DBN in the mixture (rewards are omitted) for a three-level hierarchical controller. Consider a POMDP defined by the tuple S, A, O, ps , ps |as , po |s a , ras and a threelevel hierarchical (non-recursive) controller defined by (a) N2 E1 N1 E0 N0 N 2 (b) N2 N 2 4.2.1 Parameter initialization N 1 N1 N 1 t W.l.o.g. we initialize the start node N0op of the top top layer to be the first node (i.e., Pr(N0 = 1) = 1). The node conditional distributions pn l |(n l ) are initialized randomly as a mixture of three distributions: N 0 N0 0 pn l |(n l ) c1 + c2 Un l (n l ) + c3 n l nl A N O A O S O S S O S (c) N2 N 2 (d) N N 2 N 1 N 0 SS N 2 N1N0S N1 N 1 N 2 The mixture components are a uniform distribution, a random distribution U(n l ) (an array of uniform random numbers in [0, 1]), and a term enforcing nl to stay unchanged. For the node distributions at the base level we choose c1 = 1, c2 = 1, c3 = 0 and for all other levels we choose c1 = 1, c2 = 1, c3 = 10. Similarly we initialize the action probabilities as pa|nbase c1 + c2 Uanbase + c3 a(nbase %a) with c1 = 1, c2 = 1, c3 = 100, where the last term enforces each node nbase = i to be associated with action a = i%a. 4.2.2 E-step N 2N1N0S 2 N N1N0S N0 0 N 2 N1N 1N0S 2 N N 1N0S F 2 N 1N0N 0S S N S igure 3: (a) Two slices of the DBN encoding the hierarchical POMDP controller. (b) A version where exit variables are eliminated. (c) Variables O and A are eliminated. (d) The corresponding junction tree (or rather chain) for inference. 4.2 Maximum Likeliho o d Estimation Following Toussaint et al.'s technique [18], we optimize a factored controller by maximizing the reward likelihood. Since the policy parameters are conditional probability distributions of the DBN, the EM algorithm can be used to optimize them. Computation alternates between the E and M steps below. We denote by ntop and nbase the top and base nodes in a given time slice. We also denote by (V ) and (v ) the parents of V and a configuration of the parents of V . E-step: expected frequency of the hidden variables t ~ Entop = Pr(N0op = ntop |R = 1) t ~ Eanbase = Pr(At = a, Ntbase = nbase |R = 1) En l (n l ) = t ~ Pr(Ntl+1 = n l , (Ntl+1 ) = (nl +1 )|R = 1) l t M-step: relative frnquency computation e pntop = Entop / top Entop a pa|nbase = Eanbase / Eanbase n pn l |(n l ) = En l (n l ) / l En l (n l ) l To speed up the computation of the inference queries in the E-step, we compute intermediate terms using a forward-backward procedure. Let tmax be the largest value of T , then a simple scheme that answers each query separately takes O(t2 ax ) time since there are m O(tmax ) queries and each query takes O(tmax ) time to run over the entire network. However, since part of the computation is duplicated in several queries, it is possible to compute intermediate terms and in O(tmax ) time from which each expectation can be computed in constant time (w.r.t. tmax ). To simplify notation, N and n denote all the nodes and their joint configuration in a given time slice. t Forward term: ns = Pr(Nt = n, St = s) 0 ns = pn ps n t-1 t n s = ,s ns pn s |ns ~ Backward term: ns = Pr(R = 1|Nt- = n, St- = s, T = t) a 0 ns = pr n a|n as -1 ns = ,s pn s |ns n s To fully take advantage of the structure of the DBN, we first marginalize the DBN w.r.t. the observations and actions to get the DBN in Fig. 3c. This 2-slice DBN corresponds to the joint transition distribution pn s |ns used in the above equations. Then we compile this 2-slice DBN into the junction tree (actually junction chain) given in Fig. 3d. t Let ns = Pr(T = )ns and ns = Pr(T = t t)ns , then the last two expectations of the E-step can be computed as follows:2 r s Eanbase ,n-{nbase } ns pa|nbase as + E s ,o ,n ps |as po |s apn |o nn s s l (n l ) n ,s ,a,n-(n l ),n -l ns pa|nbase ps |as r po |s apn |o n as + n s l 4.2.3 M-step Table 1: Number of parameters and computational complexity for the flat controller with |N | nodes and a 2-layer factored controller with |N top | = |N base | = |N |0.5 nodes. # parameters flat |O||N |2 + |A||N | fact. 2|O||N |1.5 + |A||N |0.5 forward-backward complexity flat O(tmax (|N ||S |2 + |N |2 |S |)) fact. O(tmax (|N ||S |2 + |N |1.5 |S |)) expectation complexity flat O(|N ||A|(|S |2 + |S ||O|) + |N |2 |S ||O|) fact. O(|N ||A|(|S |2 + |S ||O|) + |N |1.5 |O||S | + |N |2 |O|) The standard M-step adjusts each parameter pv|(v) by normalizing the expectations computed in the Eew step, i.e., pn|(v) Ev(v) . To speed up convergence, v we instead use a variant that performs a soften greedy ew M-step. In the greedy M-step, each parameter pn|(v) v is greedily set to 1 when v = argmaxv fv(v) and 0 ¯¯¯ otherwise, where fv(v) = Ev(v) /pol|d (v) . The greedy v M-step can be thought of as the limit of an infinite sequence of alternating partial E-step and standard M-step where the partial E-step keeps f fixed. The combination of a standard M-step with this specific partial E-step updates pv|(v) by a multiplicative factor proportional to fv(v) . In the limit, the largest fv(v) ends up giving all the probability to the corresponding pv|(v) . EM variants with certain types of partial E-steps ensure monotonic improvement of the likelihood when the hidden variables are independent [11]. This is not the case here, however by softening the greedy M-step we can still obtain monotonic improvement most of the time while speeding up convergence. We update pv|(v) as follows: v = argmax fv(v) ew pn|(v) v v old pv|(v) [vv level have fewer parameters and a smaller complexity, but also a smaller policy space due to the structure imposed by the hierarchy/factorization. While there is a tradeoff between policy space and complexity, hierarchical and factored controllers are often advantageous in practice since they can find more quickly a good hierarchical/factored policy when there exists one. A 2-level factored controller with |N |0.5 nodes at each level has 2|O||N |1.5 parameters for pn top |o nbase ntop and pn base |n top o nbase , and |A||N |0.5 parameters for The complexity of the forward (backpa|nbase . ward) procedure is O(tmax (|N ||S |2 + |N |1.5 |S |)) and the complexity of computing the expectations is O(|N ||A|(|S |2 + |S ||O|) + |N |1.5 |O||S | + |N |2 |O|). A 2-level hierarchical controller is further restricted and therefore has fewer parameters, but the same time complexity. +c+ ] . For c = 0 and = 0 this is the greedy M-step. We use c = 3 which softens (shortens) the step and improves convergence. Furthermore, adding small Gaussian noise N (0, 10-3 ) helps to escape local minima. 4.2.4 Complexity 5 Exp eriments For a flat controller, the number of parameters (neglecting normalization) is |O||N |2 for pn |o n and |A||N | for pa|n . The complexity of the forward (backward) procedure is O(tmax (|N ||S |2 + |N |2 |S |)) where the two terms correspond to the size of the two cliques for inference in the 2-slice DBN after O and A are eliminated. The complexity of computing the expectations from and is O(|N ||A|(|S |2 + |S ||O|) + |N |2 |S ||O|), which corresponds to the clique sizes of the 2-slice DBN including O and A. In comparison, 2-level hierarchical and factored controllers with |N top | = |N base | = |N |0.5 nodes at each 2 The first expectation of the E-step does not need to be t computed since Pr(N0op = 1) = 1. We first compared the performance of the maximum likelihood (ML) approach to previous optimizationbased approaches from [4]. Table 2 summarizes the results for 2-layer controllers with certain combinations of |N base | and |N top |. The problems include paint, shuttle and 4x4 maze (previously used in [4]) and three additional problems: chain-of-chains (described below), hand-washing (reduced version from [8]) and cheese-taxi (variant from [12]). On the first three problems, ML reaches the same values as the previous optimization-based approaches, but with larger controllers. We attribute this to EM's weaker ability to avoid local optima than the optimization-based approaches. However, the optimization-based approaches run out of memory on the last three problems (memory needs exceed 2 Gb of RAM), while ML scales gracefully (as analyzed in Sect. 4.2.4). ML approach demonstrates that hierarchy discovery can be tackled with tractable algorithms. We also report the values reached with a state of the art point-based value Table 2: V denotes optimal values (with truncated tra jectories) [3] except for handwashing and cheese-taxi where we show the optimal value of the equivalent fully-observable problem. HSVI2 found a solution in less than 1s for every problem except handwashing where the algorithm was halted after 12 hours of computation. The ML approach optimizes a factored controller for 200 EM iterations with a planning horizon of tmax = 100. (5,3) nodes means |N base | = 5 and |N top | = 3. For cheese-taxi, we get a maximum value of 2.25. N/A indicates that the solver did not complete successfully. All tests are done on a dual-core x64 processor @2.2GHz. Problem paint shuttle 4x4 maze chain-of-chains handwashing cheese-taxi |S |, |A|, |O| 4, 4, 2 8, 3, 5 16, 4, 2 10, 4, 1 84, 7, 12 33, 7, 10 V 3.28 32.7 3.7 157.1 1052 5.3 HSVI2 V 3.29 ± 0.04 32.9 ± 0.8 3.75 ± 0.1 157.1 ± 0 N/A 2.53 ± 0.3 Best results from [4] nodes t(s) V (1,3) <1 3.29 (1,3) 2 31.87 (1,2) 30 3.73 (3,3) 10 0.0 N/A N/A ML approach (avg. over 10 runs) nodes t(s) V (5,3) 0.96 ± 0.3 3.26 ± 0.004 (5,3) 2.81 ± 0.2 31.6 ± 0.5 (3,3) 2.8 ± 0.8 3.72 ± 8e - 5 (10,3) 6.4 ± 0.2 151.6 ± 2.6 (10,5) 655 ± 2 984 ± 1 (10,3) 311 ± 14 -9 ± 11(2.25 ) iteration method (HSVI2 [15]). The next question is whether there are computational savings when automatically discovering a hierarchy. Recall that previous work has shown that policy optimization is simplified when a hierarchy is known a priori since the space of policies is restricted. The next experiment demonstrates that policy optimization while discovering a hierarchy can be done faster and/or yield higher value when there exists good hierarchical policies. Table 3 compares the performance when optimizing flat, hierarchical and factored controllers on chain-of-chains, hand-washing and cheesetaxi. Here, the factored and hierarchical controllers have two levels and correspond respectively to the DBNs in Fig. 3(a) and 3(b).3 The x-axis is the number of nodes for flat controllers and the product of the number of nodes at each level for hierarchical and factored controllers. Taking the product is justified by the fact that the equivalent flat controllers of some hierarchical/factored controllers require that many nodes. The graphs in the right column of Table 3 demonstrate that hierarchical and factored controllers can be optimized faster, confirming the analysis done in Sect. 4.2.4. There is no difference in computational complexity between the strictly hierarchical and unconstrained factored architectures. Recall however that the efficiency gains of the hierarchical and factored controllers are obtained at the cost of a restricted policy space. Nevertheless, the graphs in the left column of Table 3 suggest that hierarchical/factored controllers can still find equally good policies when there exist one. Factored controllers are generally the most robust. With a sufficient number of nodes, they find the best policies on all three problems. Note that factored and hierarchical controllers need at least a number of nodes equal to the number of actions in the base layer in order to represent a policy that uses all actions. 3 Factored controllers are hierarchical controllers where the restrictions imposed by the Et variables are removed. Level 1 0 0.16 0.84 1 B 2 D 3 C Level 0 A Figure 4: Hierarchical controller learnt for the chainof-chains. The diamond indicates an exit node, for which pe0 |n0 = 1. ^ This explains why hierarchical and factored controllers with less than 4 base nodes (for chain-of-chains) and 7 base nodes (for hand-washing and cheese-taxi) do poorly. The optimization of flat controllers tend to get stuck in local optima if too many nodes are used. Comparing the unconstrained factored architecture versus hierarchical, we find that the additional constraints in the hierarchical controller make the optimization problem harder although there are less parameters to optimize. As a result, EM gets stuck more often in local optima. We also examine whether learnt hierarchies make intuitive sense. Good policies for the cheese-taxi and handwashing problems can often be represented hierarchically, however the hierarchical policies found didn't match hierarchies expected by the authors. Since these are non-trivial problems for which there may be many ways to represent good policies in a hierarchical fashion that is not intuitive, we designed the chain-ofchains problem, which is much simpler to analyze. The optimal policy of this problem consists of executing n times the same chain of n actions followed by a submit action to earn the only reward. The optimal policy requires n2 + 1 nodes for flat controllers and n + 1 nodes at each level for hierarchical controllers. For n = 3, ML found a hierarchical controller of 4 nodes at each level, illustrated in Fig. 4. The controller starts in node 0. Nodes at level 1 are abstract and descend into concrete nodes at level 0 by following the dashed Table 3: Left: The reached values depending on the number of nodes in the controller. For the factored and hierarchical controller we indicate the number of nodes in both layers (e.g., (5,3) means |N base | = 5 and |N top | = 3) and plot the data point at |N base ||N top | on the x-axis. For instance, in the case of handwashing we see how the performance depends critically on |N base |. Right: The optimization time. In all cases, 200 EM iterations are performed with a planning horizon of tmax = 100. The results for each controller are the average of 10 runs with error bars of ±1 standard deviation. 160 chain-of-chains: value 140 120 100 80 60 40 20 0 0 990 980 handwashing: value 970 960 950 940 930 920 910 900 0 5 cheese-taxi: time (seconds) (10,3) (10,5) (10,7) (10,10) ( ( (3,3) 3,5) 3,7) (3,10) (3,3) 3,5) 3,7) (3,10) ( ( (7,5) (7,3) (5,5) (5,3) (5,3) (5,5) (5,7) (7,7) (10,5) (5,10) (10,3) ( (7 ( (3,3) 3,5) 3,7) (3,10) ,5) (7,3) (3,3) 3,5) 3,7) (3,10) ( ( (7,7) (5,3) (5,3) (7,3) (10,3) (5,7) (5,7) (5,5) (7,5) (5,5) (10,5) (5,10 (7,7) ) (10,7) (7,10) (10,10) flat 7,1 fa(cto0) d re hierarchical 60 70 (10,7) 10 20 30 40 50 (10,5) 80 90 100 (10,10) chain-of-chains: time (seconds) 180 100 90 80 70 60 50 40 30 20 10 0 0 4000 flat factored hierarchical 10 20 30 40 50 60 70 80 90 100 handwashing: time (seconds) 3500 3000 2500 2000 1500 1000 500 0 0 3500 3000 2500 2000 1500 1000 500 0 0 flat factored hierarchical (7 (7,3) (10,3) ,5) (7,7) (7,10) (7,10) (10,7) (10,10) (5,7) (5,10) flat factored hierarchical 40 50 60 70 80 90 100 10 20 30 10 20 30 40 50 60 70 80 90 100 cheese-taxi: best value 0 -5 -10 -15 (7,7) flat factored hierarchical flat factored hierarchical -20 -25 0 (3 5 (3 7 5 (3 (7 7 ( 5 (3,3) 5,3) 7,3) ,5) ,10) ,5) (3 5 (3 7 5 (3 (7 7 ( 5 (3,3) 5,3) 7,3) ,5) ,10) ,5) (5,10) (7,7) ) (5,10 (7,10) (7,10) 10 20 30 40 50 nodes 60 70 80 90 100 10 20 30 40 50 nodes 60 70 80 90 100 edges. Control is returned to level 1 when an end node (denoted by a diamond) is reached. Here, the optimal policy is to do A-B-C three times followed by D. Hence a natural hierarchy would abstract A-B-C and D into separate subcontrollers. While the controller in Fig. 4 is not completely optimal (the vertical transition from abstract node 0 should have probability 1 of reaching node A), it found an equivalent, but less intuitive abstraction by having subcontrollers that do A-B-C and D-A-B-C. This suggests that for real-world problems there will be many valid abstractions that are not easily interpretable by humans and the odds that an automated procedure finds an intuitive hierarchy without any additional guidance are slim. 6 Conclusion The key advantage of maximum likelihood is that it can exploit the factored structure in a controller architecture. This facilitates hierarchy discovery when the hierarchical structure of the controller is encoded into a corresponding dynamic Bayesian network (DBN). Our complexity analysis and the empirical run time analysis confirm the favorable scaling. In particular, we solved problems like handwashing and cheese-taxi that could not be solved with the previous approaches in [4]. Compared to flat controllers, factored controllers are faster to optimize and less sensitive to local optima when they have many nodes. Our current implementation does not exploit any factored structure in the state, action and observation space, however we envision that a factored implementation would naturally scale to large factored POMDPs. For the chain-of-chains problem, maximum likelihood finds a valid hierarchy. For other problems like handwashing, there might be many hierarchies and the one found by our algorithm is usually hard to interpret. We cannot expect our method to find a hierarchy that is human readable. Interestingly, although the strictly hierarchical architectures have less parameters to optimize, they seem to be more susceptible to local optima as compared to a factored but otherwise unconstrained controller. Future work will investigate various heuristics to escape local optima during optimization. In this paper we made explicit assumptions about the structure ­ we prefixed the structure of the DBN to mimic a strict hierarchy or a level-wise factorization and we fixed the number of nodes in each level. However, the DBN framework allows us to build on existing methods for structure learning of graphical models. A promising extension would be to use such structure learning techniques to optimize the factored structure of the controller. Since the computational complexity for evaluating (training) a single structure is reasonable, techniques like MCMC could sample and evaluate a variety of structures. This variety might also help to circumvent local optima, which currently define the most dominant limit of our approach. Acknowledgments Part of this work was completed while Charlin was at the University of Waterloo. Toussaint acknowledges support by the German Research Foundation (DFG), Emmy Noether fellowship TO 409/1-3. Poupart and Charlin were supported by grants from the Natural Sciences and Engineering Research Council of Canada, the Canada Foundation for Innovation and the Ontario Innovation Trust. tially observable environments. In NIPS, pages 225­232, 2006. [5] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1­38, 1977. [6] E. Hansen. An improved policy iteration algorithm for partially observable MDPs. In NIPS, 1998. [7] E. Hansen and R. Zhou. Synthesis of hierarchical finite-state controllers for POMDPs. In ICAPS, pages 113­122, 2003. [8] J. Hoey, A. von Bertoldi, P. Poupart, and A. Mihailidis. Assisting persons with dementia during handwashing using a partially observable Markov decision process. In ICVS, 2007. [9] N. Meuleau, L. Peshkin, K.-E. Kim, and L. Kaelbling. Learning finite-state controllers for partially observable environments. In UAI, pages 427­436, 1999. [10] K. Murphy and M. Paskin. Linear time inference in hierarchical HMMs. In NIPS, 2001. [11] R. Neal and G. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models. Kluwer, 1998. [12] J. Pineau. Tractable Planning Under Uncertainty: Exploiting Structure. PhD thesis, Robotics Institute, Carnegie Mellon University, 2004. [13] J. Pineau, G. Gordon, and S. Thrun. Policycontingent abstraction for robust robot control. In UAI, pages 477­484, 2003. [14] P. Poupart and C. Boutilier. Bounded finite state controllers. In NIPS, 2003. [15] T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In UAI, 2004. [16] P. Stone and M. Veloso. A layered approach to learning client behaviors in the RoboCup soccer server. Applied Artificial Intel ligence, 12:165­188, 1998. [17] G. Theocharous, K. Murphy, and L. Pack Kaelbling. Representing hierarchical POMDPs as DBNs for multi-scale robot localization. In ICRA, pages 1045­1051. IEEE, 2004. [18] M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic inference for solving (PO)MDPs. Technical Report EDI-INF-RR-0934, School of Informatics, University of Edinburgh, 2006. References [1] C. Amato, D. Bernstein, and S. Zilberstein. Solving POMDPs using quadratically constrained linear programs. In IJCAI, pages 2418­2424, 2007. [2] D. Braziunas and C. Boutilier. Stochastic local search for POMDP controllers. In AAAI, pages 690­696, 2004. [3] A. Cassandra. Exact and approximate algorithms for partial ly observable Markov decision processes. PhD thesis, Brown University, Dept. of Computer Science, 1998. [4] L. Charlin, P. Poupart, and R. Shioda. Automated hierarchy discovery for planning in par-