Automatic Discovery and Transfer of MAXQ Hierarchies

Neville Mehta Soumya Ray Prasad Tadepalli Thomas Dietterich Oregon State University, Corvallis OR 97331, USA

mehtane@eecs.oregonstate.edu sray@eecs.oregonstate.edu tadepall@eecs.oregonstate.edu tgd@eecs.oregonstate.edu

Abstract
We present an algorithm, HI-MAT (Hierarchy Induction via Models And Tra jectories), that discovers MAXQ task hierarchies by applying dynamic Bayesian network models to a successful tra jectory from a source reinforcement learning task. HI-MAT discovers subtasks by analyzing the causal and temporal relationships among the actions in the tra jectory. Under appropriate assumptions, HI-MAT induces hierarchies that are consistent with the observed tra jectory and have compact value-function tables employing safe state abstractions. We demonstrate empirically that HI-MAT constructs compact hierarchies that are comparable to manuallyengineered hierarchies and facilitate significant speedup in learning when transferred to a target task.

pelling for at least two reasons. First, it avoids the significant human effort in engineering the task-subtask structural decomposition, along with the associated state abstractions and subtask goals. Second, if the same hierarchy is useful in multiple domains, it leads to significant transfer of learned structural knowledge from one domain to the other. The cost of learning can be amortized over several domains. Several researchers have focused on the problem of automatically inducing temporally extended actions and task hierarchies (Thrun & Schwartz, 1995; McGovern & Barto, 2001; Menache et al., 2001; Pickett & Barto, 2002; Hengst, 2002; Sim¸ek & Barto, 2004; Jonsson & Barto, 2006). ¸s In this paper, we focus on the asymmetric know ledge transfer setting where we are given access to solved source RL problems. The ob jective is to derive useful biases from these solutions that could speed up learning in target problems. We present and evaluate our approach, HI-MAT, for learning MAXQ hierarchies from a solved RL problem. HI-MAT applies dynamic Bayesian network (DBN) models to a single successful tra jectory from the source problem to construct a causal ly annotated trajectory (CAT). Guided by the causal and temporal associations between actions in the CAT, HI-MAT recursively parses it and defines MAXQ subtasks based on each discovered partition of the CAT. We analyze our approach both theoretically and empirically. Our theoretical results show that, under appropriate conditions, the task hierarchies induced by HI-MAT are consistent with the observed tra jectory, and possess compact value-function tables that are safe with respect to state abstraction. Empirically, we show that (1) using a successful tra jectory can result in more compact task decompositions than when using only DBNs, (2) our induced hierarchies are comparable to manually-engineered hierarchies on target RL tasks, and MAXQ-learning converges significantly faster than flat Q-learning on those tasks, and

1. Intro duction
Scaling up reinforcement learning (RL) to large domains requires leveraging the structure in these domains. Hierarchical reinforcement learning (HRL) provides mechanisms through which domain structure can be exploited to constrain the value function and policy space of the learner, and hence speed up learning (Sutton et al., 1999; Dietterich, 2000; Andre & Russell, 2002). In the MAXQ framework, a task hierarchy is defined (along with relevant state variables) for representing the value function of the overall task. This allows for decomposed subtask-specific value functions that are easier to learn than the global value function. Automated discovery of such task hierarchies is comAppearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Automatic Discovery and Transfer of MAXQ Hierarchies

(3) transferring hierarchical structure from a source task can speed up learning in target RL tasks where transferring value functions cannot.

2. Background and Related Work
We briefly review the MAXQ framework (Dietterich, 2000). This framework facilitates learning separate value functions for subtasks which can be composed to compute the value function for the overall semiMarkov Decision Process (SMDP) with state space S and action space A. The task hierarchy H is represented as a directed acyclic graph called the task graph, and reflects the task-subtask relationships. Leaf nodes are the primitive subtasks corresponding to A. Each composite subtask Ti defines an SMDP with parameters Xi , Si , Gi , Ci , where Xi is the set of relevant state variables, Si  S is the set of admissible states, Gi is the termination/goal predicate, and Ci is the set of child tasks of Ti . T0 represents the root task. Ti can be invoked in any state s  Si , it terminates when s  Gi , and (s, a) is called an exit if Pr(s |s, a) > 0. The set Si is defined using a pro jection function that maps a world state to an abstract state defined by a subset of the state variables. A safe abstraction function only merges world states that have identical values. The local policy for a subtask Ti is a mapping i : Si  Ci . A hierarchical policy  for the overall task is an assignment of a local policy to each Ti . A hierarchical ly optimal policy for a given MAXQ graph is a hierarchical policy that has the best possible expected total reward. A hierarchical policy is recursively optimal if the local policy for each subtask is optimal given that all its child tasks are in turn recursively optimal. HEXQ (Hengst, 2002) and VISA (Jonsson & Barto, 2006) are two existing approaches to learning task hierarchies. These methods define subtasks based on the changing values of state variables. HEXQ employs a heuristic that orders state variables based on the frequencies of change in their values to induce an exitoption hierarchy. The most frequently-changing variable is associated with the lowest-level subtask, and the least frequently-changing variable with the root. VISA uses DBNs to analyze the influence of state variables on one another. The variables are partitioned such that there is an acyclic influence relationship between the variables in different clusters (stronglyconnected components). Here, state variables that influence others are associated with lower-level subtasks. VISA provides a more principled rationale for HEXQ's heuristic ­ a variable used to satisfy a precondition for setting another variable through an action typically

changes more frequently than the other variable. A key difference between VISA and HI-MAT is the use of a successful tra jectory in addition to the DBNs. In Section 5.1, we provide empirical evidence that this allows HI-MAT to learn hierarchies that are exponentially more compact than those of VISA. The algorithm developed by Marthi et al. (2007) takes a search-based approach to generating hierarchies. Flat Q-value functions are learned for the source domain, and are used to sample tra jectories. A greedy top-down search is conducted for the best-scoring hierarchy that fits the tra jectories. The set of relevant state variables for each task is determined through statistical tests on the Q values of different states with differing values of the variables. In contrast to this approach, HI-MAT relies less on direct search through the hierarchy space, and more on the causal analysis of a tra jectory based on DBN models.

3. Discovering MAXQ Hierarchies
In this work, we consider MDPs where the agent is solving a known conjunctive goal. This is a subset of the class of stochastic shortest-path MDPs. In such MDPs, there is a goal state (or a set of goal states), and the optimal policy for the agent is to reach such a state as quickly as possible. We assume that we are given factored DBN models for the source MDP where the conditional probability distributions are represented as trees (CPTs). Further, we are given a successful tra jectory that reaches the goal in the source MDP. With this in hand, our ob jective is to automatically induce a MAXQ hierarchy that can suitably constrain the policy space when solving a related target problem, and therefore achieve faster convergence in the target problem. This is achieved via recursive partitioning of the given tra jectory into subtasks using a top-down parse guided by backward chaining from the goal. We use the DBNs along with the tra jectory to define the termination predicate, the set of subtasks, and the relevant abstraction for each MAXQ subtask. We use the Taxi domain (Dietterich, 2000) to illustrate our procedure. Here, a taxi has to transport a passenger from a source location to a destination location within a 5 × 5 grid-world. The pass.dest variable is restricted to one of four special locations on the grid denoted by R, G, B, Y; the pass.loc could be set to R, G, B, Y or in-taxi; taxi.loc could be one of the 25 cells. The goal of pass.loc = pass.dest is achieved by taking the passenger to its intended destination. Besides the four navigation actions, a successful Pickup changes pass.loc to in-taxi, and a successful Putdown changes pass.loc from in-taxi to the value of pass.dest.


Automatic Discovery and Transfer of MAXQ Hierarchies
pass.loc pass.loc
taxi.loc

Start

taxi.loc

North

taxi.loc

East

taxi.loc

Pickup

taxi.loc

West

taxi.loc

taxi.loc

South

taxi.loc

Putdown

taxi.loc pass.*

End

pass.dest

Figure 1. A sample CAT for the Taxi domain.

3.1. Definitions and Notation We say that a variable v is relevant to an action a if the reward and transition dynamics for a either check or change v ; it is irrelevant otherwise. The set of trajectory-relevant (t-relevant) variables of a, a subset of the relevant variables, are the variables that were actually checked or changed when a was executed in v the tra jectory. A causal edge a - b connects a to another action b (b following a in the tra jectory) iff v is t-relevant to both a and b, and irrelevant to all actions v in between. A sink edge, a - End connects a with a dummy End action iff v is relevant to a and irrelevant to all actions before the final goal state; this holds v analogously for a source edge Start - a. A causal ly annotated trajectory (CAT) is the original tra jectory annotated with all the causal, source, and sink edges. Moreover, the CAT is preprocessed to remove any cycles present in the original tra jectory (failed actions, such as an unsuccessful Pickup, introduce cycles of unit length). A sample CAT for Taxi is shown in Figure 1. Given a - b, the phrase "literal on a causal edge" refers to a formula of the form v = V where V is the value taken by v in the state before b is executed. We define DBN-closure(v ) as the set of variables that influence v recursively as follows. From the action DBNs, add all variables that appear in internal nodes in the CPTs for the dynamics of v . Next, for each added variable u, union DBN-closure(u) with this set, repeating until no new variables are added. Similarly, the set DBN-closure(reward) contains all variables that influence the reward function of the MDP. The set DBN-closure(f luent) is the union of the DBN-closures of all variables in the fluent. For example, DBN-closure(g oal) is the set of all variables that influence the goal fluent. The CAT ignores all variables v  DBN-closure(g oal), namely, those vari/ ables that never affect the goal conjunction. 3.2. The HI-MAT Algorithm Given a CAT and the MDP's goal predicate (or recursively, the current subtask's goal predicate), the main loop of the hierarchy induction procedure is illustrated in Algorithm 1. The algorithm first checks if two stopping criteria are satisfied (lines 2 & 4): either the tra jectory contains only a single primitive
v

action, or it consists of actions whose relevances are identical. (In the latter case, any further partitioning would yield subtasks with the same abstraction as the parent.) Otherwise, it first initializes the set of "unsolved" goals to the set of literals in the goal conjunction (line 9). It then selects any unsolved goal u, and finds the corresponding subtask (line 12). Algorithm 2 returns indices i, j marking the boundaries of the subtask in the CAT. If this CAT segment is nontrivial (neither just the initial state nor the whole trajectory), it is stored (line 17), and the literals on causal edges that enter it (from earlier in the tra jectory) are added to the unsolved goals (line 18). This ensures that the algorithm parses the entire tra jectory barring redundant actions. If the tra jectory segment is equal to the entire tra jectory, this implies that the tra jectory achieves only the literal u after the ultimate action. In this case, the tra jectory is split into two segments: one segment contains the prefix of the ultimate action an with the preconditions of an forming the goal literals for this segment (line 14); the other segment contains only the ultimate action an (line 15). CAT scanning is repeated until all subgoal literals are accounted for. The only way tra jectory segments can overlap is if they have identical boundaries, and the ultimate action achieves the literals of all these segments. In this case, the segments are merged (line 23). Merging replaces the duplicative segments with one that is assigned a conjunction of the subgoal literals. The HI-MAT algorithm partitions the CAT into unique segments, each achieving a single literal or a conjunction of literals due to merging. It is called recursively on each element of the partition (line 27). It can be proved that the set of subtasks output by the algorithm is independent of the order in which the literal u is picked (line 11). 3.2.1. Subtask Detection Given a literal, a subtask is determined by finding the set of temporally contiguous actions that are closed with respect to the causal edges in the CAT such that the final action achieves the literal. The idea is to group all actions that contribute to achieving the specific literal being considered. This procedure is shown in Algorithm 2.


Automatic Discovery and Transfer of MAXQ Hierarchies

Algorithm 1 HI-MAT
Input: CAT , Goal predicate G. Output: Task X, S, G, C ; X is the set of relevant variables, S is the set of non-terminal states, G is the goal predicate, C is the set of child actions. 1: n  Number of actions in  excluding Start and End 2: if n = 1 then // Single action 3: 4 return RelVars(), S , true, a1 : else if CheckRelVars() then // Same relevance 5: S  All states that reach G via Actions() 6: 7 return RelVars(), S, G, Actions() : end if 8:    // Tra jectory segments 9: U  Literals(G) 10: while U =  do 11: Pick u  U 12: (i, j, u)  CAT-Scan(, u) 13: if i = 1  j = n then 14:     {(1, n - 1, v ) : v  Precondition(an )} 15:     {(n, n, )} 16: else if j > 0 then // Last segment action = Start 17:     {(i, j, u)} v 18: U  U  {v : k < i l ak - al  , i  l  j } 19: end if 20: U  U - {u} 21: end while 22: while (i, j, u1 ), (i, j, u2 )   do 23:   ( - {(i, j, u1 ), (i, j, u2 )})  {(i, j, u1  u2 )} 24: end while 25: C   26: for t   do 27: Xt , St , Gt , Ct  HI-MAT(Extract(, ti , tj ), tu ) 28: C  C  { Xt , St , Gt , Ct } 29: end for 30: X  RelVars()  Variables(G) 31: S  All states that reach G via C A : return X, S, G, C 32

task, start executing a sibling subtask, and then return to executing the interrupted subtask. 3.2.2. Termination Predicate After finding the partition that constitutes a subtask, we assign a set of child tasks and a termination predicate to it. To assign the termination condition to a subtask, we consider the relational test(s) tu in the action and reward DBNs involving the variable u on the causal edge leaving the subtask (line 27 of Algorithm 1). When a subtask's relational termination condition involves other variables not already in the abstraction, these variables are added to the state abstraction (line 30), effectively creating a parameterized subtask. For example, consider the navigation subtask that terminates when taxi.loc = pass.dest in the Taxi domain. The abstraction for this subtask already involves taxi.loc. However, pass.dest in the relational test implies that pass.dest behaves like a parameter for this subtask. 3.2.3. Action Generalization To determine if the set of primitive actions available to any subtask should be expanded, we follow a bottomup procedure (not shown in Algorithm 1). We start with subtasks that have only primitive actions as children. We create a merged DBN structure for such a subtask T using the incorporated primitive actions. The merged DBN represents possible variable effects after any sequence of these primitive actions. Next, for each primitive action that we did not see in this tra jectory, we consider the subgraph of its DBN that only involves the variables relevant to T . If this is a subgraph of the merged DBN of T , we add this action to the set of actions available to T . The rationale here is that the added action has similar effects to the actions we observed in the tra jectory, and it does not increase the set of relevant variables for T . For example, if the navigation actions used on the observed tra jectory consisted only of North and East actions, this procedure would also add South and West to the available actions for this subtask. When considering subtasks that have non-primitive children, we only consider adding actions that have not been added to any of the non-primitive children. Given the termination predicate and the generalized set of actions, the set of relevant variables for a subtask is the union of the set of relevant variables of the merged DBN (described above) and the variables appearing in the termination predicate (line 30). Computing the relevant variables is similar to explanationbased reinforcement learning (Tadepalli & Dietterich,

lgorithm 2 CAT-Scan
Input: CAT , literal u. Output: (i, j, u); i is the start index, j is the end index. 1: 2: 3: 4: 5: 6: Set j such that aj - End   ij-1 v while i > 0 and v k ai - ak = k  j do ii-1 end while return (i + 1, j, u)
u

As before, when considering causal edges in line 3, we can ignore all causal edges that are labeled with variables not in the DBN-closure of any variable in the current unsolved goal list. Because of the way we construct the CAT, we can show that this procedure will always stop before adding an action which has a relevant variable that is not relevant to the last action in the partition. Note that the temporal contiguity of the actions we assign to a subtask is required by the MAXQ-style execution of a policy. A hierarchical MAXQ policy cannot interrupt an unterminated sub-


Automatic Discovery and Transfer of MAXQ Hierarchies

1997) except that here we care only about the set of relevant variables and not their values. Moreover, the relevant variables are computed over a set rather than a sequence of actions.

since, otherwise, some suffix of the tra jectory can be removed while the rest still achieves the goal, violating the property of non-redundancy. Since the set Si is set to all states that do not satisfy Gi , the condition that all states s1 , . . . , sn are in Si is satisfied. Whenever the tra jectory is partitioned into a sequence of sub-tra jectories, each sub-tra jectory is associated with a conjunction of goal literals achieved by that sub-tra jectory. Hence, the above argument applies reD cursively to each such sub-tra jectory. efinition 3 A hierarchy H is safe with respect to the DBN models M if for any trajectory-task pair , Ti c onsistent with H, where Ti = Xi , Si , Gi , Ci , the total expected reward during the trajectory is only a function of the values of x  Xi in the starting state of . The above definition says that the state variables in each task are sufficient to capture the value of any tra jectory consistent with the sub-hierarchy rooted at that task node. Theorem 2 If the procedure HI-MAT produces a task hierarchy H from  and the DBN models M then H is safe with respect to M . Further, if the DBN models are maximal ly sparse, for any hierarchy H which is consistent with  and safe with respect to M , and Ti = i Xi , Si , Gi , Ci in H, there exists Ti = Xi , Si , Gi , Ci n H such that Xi  Xi . Pro of sketch: By the construction procedure, in any segment of tra jectory  composed of primitive actions under a subtask Ti , all primitive actions check or set only the variables in Xi . Thus, changing any other variables in the initial state s of  yielding s does not change the effects of these actions according to the DBN models. Similarly, all immediate rewards in the tra jectory are also functions of the variables in Xi . Hence, the total accumulated reward and the probability of the tra jectory only depend on Xi , and the hierarchy produced is safe with respect to M . Suppose that H is a consistent hierarchy which is safe with respect to M . Let ai be the last action in the tra jectory i corresponding to the subtask Ti in H. By consistency, there must be some task Ti in H that matches up with ai . Recall that Xi includes only those variables checked and set by ai to achieve the goal Gi . We claim that the abstraction variables Xi of Ti must include Xi . If this is not the case then, by maximal sparseness, there is a variable y in Xi - Xi and some values y1 and y2 such that the probabilities of the next state or reward are different based on whether y = y1 or y = y2 . Hence, H would not be safe, leading to a contradiction.

4. Theoretical Analysis
In this section, we establish certain theoretical properties of the hierarchies induced by the HI-MAT algorithm. We consider a factored SMDP state-space S = Dx1 × . . . × Dxk , where each Dxi is the domain of variable xi . We assume that our DBN models have the following property. Definition 1 A DBN model is maximal ly sparse if for any y  Y where Y is the set of parents of some node x (which represents either a state variable or the reward node), and Y = Y - {y }, y1 , y2  Dy Pr(x|Y , y = y1 ) = Pr(x|Y , y = y2 ).

Maximal sparseness implies that the parents of a variable have non-trivial influences on it; no parent can be removed without affecting the next-state distribution. A task hierarchy H = V, E , is a directed acyclic graph, where V is a set of task nodes, and E represents the task-subtask edges of the graph. Each task node Ti  V is defined as in Section 2. A tra jectory-task pair , Ti , where  = s1 , a1 , . . . , sn , an , sn+1 and Ti = Xi , Si , Gi , Ci , is consistent with H if Ti  V , and {s1 , . . . , sn }  Si . If Ti is a primitive subtask then n = 1, and Ci = a1 . If Ti is not primitive then {s1 , . . . , sn }  Gi = , sn+1  Gi , and there exist tra jectory-task pairs j , Tj consistent with H where  is a concatenation of 1 , . . . , p and T1 , . . . , Tp  Ci . A tra jectory  is consistent with a hierarchy H if , T0 is consistent with H. Definition 2 A trajectory s1 , a1 , . . . , sn , an , sn+1 is non-redundant if no subsequence of the action sequence in the trajectory, a1 , . . . , an , can be removed such that the remaining sequence stil l achieves the goal starting from s1 . Theorem 1 If a trajectory  is non-redundant then HI-MAT produces a task hierarchy H such that  is consistent with H. Pro of sketch: Let  = s1 , a1 , . . . , sn , an , sn+1 be the tra jectory. The algorithm extracts the conjunction of literals that are true in sn+1 (and not before), and assigns it to the goal, Gi . Such literals must exist


Automatic Discovery and Transfer of MAXQ Hierarchies

Corollary 1 If the DBN models are maximal ly sparse then the maximum size of the value function table for any task in the hierarchy produced by HI-MAT is the smal lest over al l safe hierarchies which are consistent with the trajectory. Pro of: Direct consequence of part 2 of the previous T theorem. he significance of the above corollary lies in the fact that the size of the value-function table is exponential in the number of variables ni = |Xi | in the abstraction of task Ti . If all features are binary and there are t tasks then the total number of values for the valuefunction tables is O(t 2nmax ). Since the hierarchy is a tree with the primitive actions at the leaves, the number of subtasks is bounded by 2l where l is the length of the tra jectory. Hence, we can claim that the number of parameters needed to fully specify the value-function tables in our hierarchy is at most O(l) times that of the best possible. Our analysis does not address state abstractions arising from the so-called funnel property of subtasks where many starting states result in a few terminal states. Funnel abstractions permit the parent task to ignore variables that, while relevant inside the child task, do not affect the terminal state. Nevertheless, our analysis captures some of the key properties of our algorithm including consistency with the tra jectory, safety, minimality, and sheds some light on its effectiveness.

Root

Root

2n-3 exit options

b0 ... bn-2 = 1

Flip(n-1)

Parity(b0,...,bn-2)  bn-2 = 1 b0  b1 = 1 Flip(0) Flip(1) Flip(n-2)

Flip(0)

Flip(n-2) Flip(n-1)

(a) VISA hierarchy. Task labels are the exit conditions; dash-dot arrows indicate exit options.

(b) HI-MAT hierarchy. Task labels are the termination conditions.

Figure 2. Task hierarchies for the modified Bitflip domain.
0 -500 Total Reward -1000 -1500 -2000 -2500 -3000 0 10 20 30 Episode Q VISA HI-MAT 40 50

Figure 3. Performance of Q, VISA, and HI-MAT in the 7bit modified Bitflip domain (averaged over 20 runs).

5. Empirical Evaluation
We test three hypotheses. First, we expect that employing a successful tra jectory along with the action models will allow the HI-MAT algorithm to induce task hierarchies that are much more compact than (or at least as compact as) just using the action models. Second, in a transfer setting, we expect that the hierarchies induced by HI-MAT will speed up convergence to the optimal policy in related target problems. Finally, we expect that the HI-MAT hierarchies will be applicable to and speed up learning in RL problems which are different enough from the source problems such that value functions either do not transfer or lead to poor transfer. 5.1. Contribution of the Tra jectory To highlight our first hypothesis, a modified Bitflip domain (Diuk et al., 2006) is designed as follows. The state is represented by n bits, b0 b1 . . . bn-1 . There are n actions denoted by Flip(i). Flip(i) toggles bi if both

bi-1 is set and the parity across bits b0 , . . . , bi-1 is even when i is even (odd otherwise); if not, it resets the bits b0 , . . . , bi . All bits are reset at the initial state, and the goal is to set all bits. We ran both VISA and HI-MAT in this domain with n = 7, and compared the induced hierarchies (Figure 2). We observe that VISA constructs an exponentially sized hierarchy even with subtask merging activated within VISA. There are two reasons for this. First, VISA relies on the full action set to construct its causal graph, and does not take advantage of any context-specific independence among its variables that may arise when the agent acts according to certain policies. Specifically, for this domain, the causal graph constructed from DBN analysis has only two strongly connected components (SCCs): one partition has {b0 , . . . , bn-2 }, and the other has {bn-1 }. This SCC cannot be further decomposed using only information from the DBNs. Second, VISA creates exit options for all strongly connected components that transitively influence the reward function, whereas only a few of these may actually be necessary to solve the problem. Specifically, for this problem, VISA creates


Automatic Discovery and Transfer of MAXQ Hierarchies

an exit condition for any instantiation that satisfies parity(b0 , . . . , bn-2 )  bn-2 = 1, resulting in exponential number of subtasks shown in Figure 2(a). The successful tra jectory provided to HI-MAT achieves the goal by setting the bits going from left to right, and results in the hierarchy in Figure 2(b). The performance results are shown in Figure 3. VISA's hierarchy converges even slower than the basic Q learner because the root has O(2n ) children as opposed to O(n). This domain has been engineered to highlight the case when access to a successful tra jectory allows for significantly more compact hierarchies than without. We expect that access to a solved instance will usually improve the compactness of the resulting hierarchy. 5.2. Transfer of the Task Hierarchy To test our remaining hypotheses, we apply the transfer setting to two domains: Taxi and the real-time strategy game Wargus. The Taxi domain has been described in Section 3. The source and target problems in Taxi differ only in the wall configurations; the passenger sources and destinations are the same. This is engineered to allow value-function transfer to occur. For Wargus, we consider the resource col lection problem. Here, the agent has units called peasants that can harvest gold and wood from goldmines and forests respectively, and deposit them at a townhall. The goal is to reach a predetermined quota of gold and wood. Since the HI-MAT approach does not currently generalize to termination conditions involving numeric predicates, the state representation of the domain replaces the actual quota variables with Boolean variables that are set when the requisite quotas of gold and wood are met. We consider target problems whose specifications are scaled up from that of the source problems, including the number of peasants, goldmines, forests, and the size of the map. In this domain, coordination does not affect the policy significantly. Thus, in the target maps, we learn a hierarchical policy for the peasants using a shared hierarchy structure without coordination (Mehta & Tadepalli, 2005). In each case, we report the total reward received as a function of the number of episodes, averaged over multiple trials. We compare three basic approaches: (1) nonhierarchical Q-learning (Q), (2) MAXQ-learning applied to a hierarchy manually engineered for each domain (Manual), and (3) MAXQ-learning applied to the HI-MAT hierarchy induced for each domain (HI-MAT). The HI-MAT algorithm first solves the source problem using flat Q-learning, and generates a successful tra jectory from it. In Taxi, we also show the performance of initializing the value-function tables with val-

0 -500 -1000 Total Reward -1500 -2000 -2500 -3000 -3500 -4000 0 5 10 Q Q with value Manual Manual with value HI-MAT HI-MAT with value 15 Episode 20 25 30

Figure 4. Performance in the Taxi domain (averaged over 20 runs). Source and target problems differ only in the configuration of the grid walls.
30000 25000 Episode Duration 20000 15000 10000 5000 0 0 10 20 30 Episode 40 50 60 Q Manual VISA HI-MAT

Figure 5. Performance in the Wargus domain (averaged over 10 runs). Source: 25 × 25 grid, 1 peasant, 2 goldmines, 2 forests, 1 townhall, 100 units of gold, 100 units of wood. Target: 50 × 50 grid, 3 peasants, 3 goldmines, 3 forests, 1 townhall, 300 units of gold, 300 units of wood.

ues learned from the source problem ­ these curves are suffixed with the phrase "with value". In Wargus, we include the performance of VISA. The results of these experiments are shown in Figures 4 and 5. Although the target problems in Taxi allow valuefunction transfer to occur, the target problems are still different enough that the agent has to "unlearn" the old policy. This leads to negative transfer evidenced in the fact that transferring value functions leads to worse rates of convergence to the optimal policy than transferring just the hierarchy structure with uninitialized policies. This indicates that transferring structural knowledge via the task-subtask decomposition can be superior to transferring value functions especially when the target problem differs significantly in terms of its optimal policy. In Wargus, the difference


Automatic Discovery and Transfer of MAXQ Hierarchies

between the source and target problems renders direct value-function transfer impossible even though the hierarchy structure still transfers. In Taxi, we observe that MAXQ-learning on HI-MAT's hierarchy converges to the optimal policy at a rate comparable to that of the manually-engineered hierarchy. However, in Wargus, HI-MAT's hierarchy is faster to converge than the manually-engineered one because, by analyzing the solved source problem, it is able to find stricter termination conditions for each subtask. Consequently, reducing the policy space in the target problem leads to a greater speed-up in learning than reducing the number of value parameters via subtask sharing as in the manually-engineered hierarchy. The improved rate of convergence is in spite of the fact that HI-MAT does not currently merge subtly different instantiations of the same subtask so there is room for further improvement. VISA's performance suffers initially due to a large branching factor at the root option (which directly includes all the navigation actions).

References
Andre, D., & Russell, S. (2002). State Abstraction for Programmable Reinforcement Learning Agents. AAAI (pp. 119­125). ¨ Sim¸ek, O., & Barto, A. (2004). Using Relative Nov¸s elty to Identify Useful Temporal Abstractions in Reinforcement Learning. ICML (pp. 751­758). Dietterich, T. (2000). Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. Journal of Artificial Intel ligence Research, 13, 227­303. Diuk, C., Littman, M., & Strehl, A. (2006). A Hierarchical Approach to Efficient Reinforcement Learning in Deterministic Domains. AAMAS (pp. 313­319). Hengst, B. (2002). Discovering Hierarchy in Reinforcement Learning with HEXQ. ICML (pp. 243­250). Jonsson, A., & Barto, A. (2006). Causal Graph Based Decomposition of Factored MDPs. Journal of Machine Learning Research, 7, 2259­2301. Marthi, B., Kaelbling, L., & Lozano-Perez, T. (2007). Learning Hierarchical Structure In Policies. NIPS Hierarchical Organization of Behavior Workshop. McGovern, A., & Barto, A. (2001). Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density. ICML (pp. 361­368). Mehta, N., & Tadepalli, P. (2005). Multi-Agent Shared Hierarchy Reinforcement Learning. ICML Rich Representations in Reinforcement Learning Workshop. Menache, I., Mannor, S., & Shimkin, N. (2001). Q-Cut - Dynamic Discovery of Sub-Goals in Reinforcement Learning. ECML (pp. 295­306). Pickett, M., & Barto, A. (2002). PolicyBlocks: An Algorithm for Creating Useful Macro-Actions in Reinforcement Learning. ICML (pp. 506­513). Sutton, R., Precup, D., & Singh, S. (1999). Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artificial Intel ligence, 112, 181­211. Tadepalli, P., & Dietterich, T. (1997). Hierarchical Explanation-Based Reinforcement Learning. ICML (pp. 358­366). Thrun, S., & Schwartz, A. (1995). Finding Structure in Reinforcement Learning. NIPS (pp. 385­392).

6. Conclusion
We have presented an approach to automatically inducing MAXQ hierarchies from solved RL problems. Given DBN models and an observed successful trajectory, our method analyzes the causal and temporal relationships between actions, and partitions the trajectory recursively into subtasks as long as the state abstraction improves. We show that the learned hierarchies are consistent, safe, and have compact valuefunction tables. Our empirical results indicate that using the observed tra jectory can allow us to learn more compact hierarchies. Further, in a transfer setting, our hierarchies perform comparably to manuallyengineered hierarchies, and improve the rate of convergence where direct policy transfer does not help. We are currently working on extending the approach to handle disjunctive goals. Further, in order to ensure hierarchical optimality, we may need to deal with non-zero pseudo-rewards. In related work, we are also investigating methods that learn compact DBN models of the MDP. Finally, an important challenge for the future is to investigate the transfer scenario where the induced hierarchy may need to be altered structurally in order to apply effectively to the target problem.

Acknowledgments
We thank Mike Wynkoop and the anonymous reviewers for their input. We gratefully acknowledge the support of the Defense Advanced Research Pro jects Agency under DARPA grant FA8750-05-2-0249.