Apprenticeship Learning Using Linear Programming Umar Syed U S Y E D @ C S . P R I N C E TO N . E D U Princeton University, Department of Computer Science, 35 Olden Street, Princeton, NJ 08540 Michael Bowling B OW L I N G @ C S . UA L B E RTA . C A University of Alberta, Department of Computing Science, Edmonton, Alberta, T6G 2E8 Canada Robert E. Schapire S C H A P I R E @ C S . P R I N C E TO N . E D U Princeton University, Department of Computer Science, 35 Olden Street, Princeton, NJ 08540 Abstract In apprenticeship learning, the goal is to learn a policy in a Markov decision process that is at least as good as a policy demonstrated by an expert. The difficulty arises in that the MDP's true reward function is assumed to be unknown. We show how to frame apprenticeship learning as a linear programming problem, and show that using an off-the-shelf LP solver to solve this problem results in a substantial improvement in running time over existing methods -- up to two orders of magnitude faster in our experiments. Additionally, our approach produces stationary policies, while all existing methods for apprenticeship learning output policies that are "mixed", i.e. randomized combinations of stationary policies. The technique used is general enough to convert any mixed policy to a stationary policy. done in this case because it is unknown). The apprenticeship learning framework, introduced by Abbeel & Ng (2004), is motivated by a couple of observations about real applications. The first is that reward functions are often difficult to describe exactly, and yet at the same time it is usually easy to specify what the rewards must depend on. A typical example, investigated by Abbeel & Ng (2004), is driving a car. When a person drives a car, it is plausible that her behavior can be viewed as maximizing some reward function, and that this reward function depends on just a few key properties of each environment state: the speed of the car, the position of other cars, the terrain, etc. The second observation is that demonstrations of good policies by experts are often plentiful. This is certainly true in the car driving example, as it is in many other applications. Abbeel & Ng (2004) assumed that the true reward function could be written as a linear combination of k known functions, and described an iterative algorithm that, given a small set of demonstrations of the expert policy, output an apprentice policy within O(k log k ) iterations that was nearly as good as the expert's policy. Syed & Schapire (2008) gave an algorithm that achieved the same guarantee in O(log k ) iterations. They also showed that by assuming that the linear combination is also a convex one, their algorithm can sometimes find an apprentice policy that is substantially better than the expert's policy. Essentially, the assumption of positive weights implies that the apprentice has some prior knowledge about which policies are better than others, and their algorithm leverages this knowledge. Existing algorithms for apprenticeship learning share a couple of properties. One is that they each use an algorithm for finding an MDP's optimal policy (e.g. value iteration or policy iteration) as a subroutine. Another is that they output apprentice policies that are "mixtures", i.e. randomized combinations of stationary policies. A stationary policy is a function of just the current environment state. Our first contribution in this paper is to show that, if one uses the linear programming approach for finding an 1. Introduction In apprenticeship learning, as with policy learning for Markov decision processes (MDPs), the objective is to find a good policy for an autonomous agent, called the "apprentice", in a stochastic environment. While the setup of an apprenticeship learning problem is almost identical to that of policy learning in an MDP, there are a few key differences. In apprenticeship learning the true reward function is unknown to the apprentice, but is assumed to be a weighted combination of several known functions. The apprentice is also assumed to have access to demonstrations from another agent, called the "expert", executing a policy in the same environment. The goal of the apprentice is to find a policy that is at least as good as the expert's policy with respect to the true reward function. This is distinct from policy learning, where the goal is to find an optimal policy with respect to the true reward function (which cannot be Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Apprenticeship Learning Using Linear Programming MDP's optimal policy (Puterman, 1994) as a subroutine, then one can modify Syed & Schapire's (2008) algorithm so that it outputs a stationary policy instead of a mixed policy. Stationary policies are desirable for a number of reasons, e.g. they are simpler to describe, and are more natural and intuitive in terms of the behavior that they prescribe. Moreover, this technique can be straightforwardly applied to any mixed policy, such as the ones output by Abbeel & Ng's (2004) algorithms, to convert it to a stationary policy that earns the same expected cumulative reward. Our technique leads naturally to the second contribution of this paper, which is the formulation of the apprenticeship learning problem as a linear program. We prove that the solution to this LP corresponds to an apprentice policy that has the same performance guarantees as those produced by existing algorithms, and that the efficiency of modern LP solvers results in a very substantial improvement in running time compared to Syed & Schapire's (2008) algorithm -- up to two orders of magnitude in our experiments. In work closely related to apprenticeship learning, Ratliff, Bagnell & Zinkevich (2006) described an algorithm for learning the true reward function by assuming that the expert's policy is not very different from the optimal policy. They took this approach because they wanted to learn policies that were similar to the expert's policy. In apprenticeship learning, by contrast, the learned apprentice policy can be very different from the expert's policy. Unlike an MDP, in apprenticeship learning the true reward function R is unknown. Instead, we are given basis rei ward functions1 R1 . . . Rk , where Rsa is the reward of state-action pair (s, a) with respect to the ith basis reward function. We assume that the true reward function R is an unknown convex combination w of the basis reward functions, i.e., for all s, a i i R sa = wi Rsa i where the unknown weights satisfy wi 0 and wi = 1. Each basis reward function Ri has a corresponding basis value function V i ( ) given by . t i ti , , V ( ) E Rst at =0 Given the assumption of positive weights, the value of k can be viewed as a measure of how much the apprentice knows about the true reward function. If k = 1, the (only) basis value of a policy is equal to its true value, and the situation reduces to a traditional MDP. At the other extreme, if the ith basis reward function is just an indicator function for the ith state-action pair, then k = |S A|, and the basis values of a policy are equal to its occupancy measure. In this situation, the apprentice knows essentially nothing about which policies are better than others. The positive weight assumption also implies that if for i i state-action pairs (s, a) and (s , a ) we have Rsa Rs a for all i, then Rsa Rs a . So the basis rewards themselves can encode prior knowledge about the true rewards. If we wish not to assert any such prior knowledge, we can simply add the negative of each basis reward function to the original set, thereby at most doubling the number of basis reward functions. We also assume that we are given a data set D of M i.i.d. sample trajectories from an expert policy E executing in the environment, where the mth trajectory is a sequence of state-action pairs visited by the expert, i.e., (sm , am , sm , am , . . . , sm , am ). For simplicity, we assume 1 1 0 0 H H that all sample trajectories are truncated to the same length H. The goal of apprenticeship learning (Abbeel & Ng, 2004) is to find an apprentice policy A such that V ( A ) V ( E ) (2) even though the true value function V ( ) is unknown (since the true reward function is unknown). 2.1. A More Refined Goal By our assumptions about the reward functions (and the linearity of expectation), we have i V ( ) = wi V i ( ). In (Abbeel & Ng, 2004) and (Syed & Schapire, 2008) these functions were called features, but we believe that the present terminology is better suited for conveying these functions' role. 1 2. Preliminaries FSrmally, o an apprenticeship learning problem c , A, , , , R1 . . . Rk , D losely resembles a Markov decision process. At each time step t, an autonomous agent occupies a state st from a finite set S , and can take an action at from a finite set A. When the agent is in state s, taking action a leads to state s with transition probability | Pr(st+1 = s st = s, at = a). Initial state sas probabilities are given by s Pr(s0 = s). The agent decides which actions to take based on its policy , where s a Pr(at = a | st = s). The value of a policy is given by w t V ( ) E , , t Rst at =0 here Rsa [-1, 1] is the reward associated with the state-action pair (s, a), and [0, 1) is a discount factor. An optimal policy is one that satisfies = arg max V ( ). We say a policy is -optimal if V ( ) - V ( ) . A policy has occupancy measure x if t t 1(st =sat =a) xa = E , , s =0 ( 1) for all s, a. In other words, xa is the expected (discounted) s number of visits to state-action pair (s, a) when following policy . Apprenticeship Learning Using Linear Programming Consequently, for any policy , the smallest possible difference between V ( ) and V ( E ) is mini V i ( ) - V i ( E ), because in the worst-case, wi = 1 for the minimizing i. Based on this observation, Syed & Schapire (2008) proposed finding an apprentice policy A that solves the maximin objective v = max min V i ( ) - V i ( E ). i 3. Multiplicative Weights Algorithm for Apprenticeship Learning Syed & Schapire (2008) observed that solving the objective in (3) is equivalent to finding an optimal strategy in a certain two-player zero-sum game. Because the size of this game's matrix is exponential in the number of states |S |, they adapted a multiplicative weights method for solving extremely large games. The resulting MWAL (Multiplicative Weights Apprenticeship Learning) algorithm is described in Algorithm 1 below. Algorithm 1 MWAL algorithm 1: Given: S , A, , , , R1 . . . Rk , D . 2: Using the expert's sample trajectories D , compute an (3) Note that if A is a solution to (3), then V ( A ) V ( E ) + v (because v = mini V i ( A ) - V i ( E ) V ( A ) - V ( E )). We also have v 0 (because = E is available in (3)). Therefore A satisfies the goal of apprenticeship learning given in (2). Syed & Schapire (2008) showed that in some cases where V ( E ) is small, v is large, and so adding v to the lower bound in (2) serves as a kind of insurance against bad experts. Our algorithms also produce apprentice policies that achieve this more refined goal. 2.2. Estimating the Expert Policy's Values Our algorithms require knowledge of the basis values of the expert's policy. From the expert's sample trajectories D, we can form an estimate V i,E of V i ( E ) as follows: H M 1m t i V i ( E ) t Rsm am t t M =1 =0 -good estimate V i,E of V i ( E ), for all i. 1 -1 2 log k (0, 1]. 3: Let = + T 1 1 4: Initialize wi = k , for i = 1 . . . k . 5: for t = 1 . . . T do 6: Compute i -optimal policy t for reward function 7: V i,E . 8: 9: 10: 11: Clearly, as the number of sample trajectories M and the truncation length H increase, the error of this estimate will decrease. Thus the issue of accurately estimating V i ( E ) is related to sample complexity, while in this work we are primarily concerned with computational complexity. To make our presentation cleaner, we will assume that D is large enough to yield an estimate V i,E of V i ( E ) such that |V i,E - V i ( E )| , for all i. We call such an estimate -good. The sample complexity of apprenticeship learning is treated in (Syed & Schapire, 2008). 2.3. Policy Types Unless otherwise noted, a policy is presumed to be stationary, i.e., sa is the probability of taking action a in state s. One exception is a mixed policy. A mixed policy is de~ fined by a set of ordered pairs {( j , j )}N 1 . The policy ~ j= is followed by choosing at time 0 one of the stationary policies j , each with probability j , and then following that policy exclusively thereafter. The value of a mixed policy is the expected value of the stationary policies it comprises, i.e., V ( ) = E ~ ~ V i ( ) = E V ( j ) ( j ) = jN j V ( j ), and j V i ( j ). ti R sa = wi Rsa . Compute -good estimate V i,t of V i ( t ), for i = 1 . . . k. t t b i,t b i,E Let wi +1 = wi V -V , for i = 1 . . . k . Renormalize w. end for Return: Let apprentice policy A be the mixed policy 1 defined by {( t , T )}T=1 . t In each iteration of the MWAL algorithm, an optimal policy t is computed with respect to the current weight vector wt . Then the weights are updated so that wi is increased/decreased if t is a bad/good policy (relative to E ) with respect to the ith basis reward function. The next theorem bounds the number of iterations T required for the MWAL algorithm to produce a good apprentice policy. The computational complexity of each iteration is discussed in Section 3.1. Theorem 1 (Syed & Schapire (2008)). Let A be the mixed policy returned by the MWAL algorithm. If l t og k T O ( (1 - ))2 V ( A ) V ( E ) + v - O( ) where v = max mini V i ( ) - V i ( E ). hen 3.1. MWAL-VI and MWAL-PI The specification of the MWAL algorithm is somewhat open-ended. Step 6 requires finding an -optimal policy in an MDP, and Step 7 requires computing -good estimates of the basis values of that policy. There are several procedures available for accomplishing each of these steps, with each option leading to a different variant of the basic algorithm. =1 Vi = jN =1 Apprenticeship Learning Using Linear Programming We briefly describe some natural options, and remark on their implications for the overall computational complexity of the MWAL algorithm. In Step 6, we can find the optimal policy using value iteration (Puterman, 1994), which has a .worst-case running l time of O og (1/ (1 - )) |S |2 |A| We can also use value iteration to compute the k basis values in Step 7 (this is sometimes called "policy evalk ation"), which implies a u . worst-case running time of O log (1/ (1 - )) |S |2 We call this variant the MWAL-VI algorithm. Another choice for Step 6 is to find the optimal policy using policy iteration (Puterman, 1994). No polynomial time bound for policy iteration is known; however, in practice it has often been observed to be faster than value iteration. We call this variant the MWAL-PI algorithm. In Section 8, we present experiments comparing these algorithms to the ones described later in the paper. the same theoretical guarantees as the MWAL algorithm, but produce stationary policies (and are also faster). To prove the correctness of these algorithms, we need to show that every mixed policy has an equivalent stationary policy. In Section 4, we said that the Dual LP method of solving an MDP outputs the occupancy measure of an optimal policy. In fact, all x that satisfy the Bellman flow constraints (5) - (6) are the occupancy measure of some stationary policy, as the next theorem shows. Theorem 2. Let x satisfy the Bellman flow constraints (5) xsa a be a stationary policy. Then - (6), and let sa = xsa x is the occupancy measure for . Conversely, if is a stationary policy such that x is its occupancy measure, xsa a then sa = and x satisfies the Bellman flow conxsa straints. An equivalent result as Theorem 2 is given in (Feinberg & Schwartz, 2002), p. 178. For completeness, a simple and direct proof is contained in the Appendix. The Bellman flow constraints make it very easy to show that, for every mixed policy, there is a stationary policy that has the same value. Theorem 3. Let be a mixed policy defined by ~ {( j , j )}N 1 , and let xj be the occupancy measure of j , j= for all j . Let be a stationary policy where ^ jjj xsa s a = a j ^ . j xj a s 4. Dual Methods for MDPs As we previously observed, the MWAL algorithm must repeatedly find the optimal policy in an MDP, and this task is usually accomplished via classic iterative techniques such as value iteration and policy iteration. However, there are other techniques available for solving MDPs, and in this work we show that they can lead to better algorithms for apprenticeship learning. Consider the following linear program: s max Rsa xsa (4) x ,a xsa 0 (6) It is well-known (Puterman, 1994) that if x is a solution to xa as is an optimal policy, and x (4) - (6), then sa = xsa is the occupancy measure of . Often (5) - (6) are called the Bellman flow constraints. The linear program in (4) - (6) is actually the dual of the linear program that is typically used to find an optimal policy in an MDP. Accordingly, solving (4) - (6) is often called the Dual LP method of solving MDPs . Having found an optimal policy by the Dual LP method, computing its values is straightforward. The next lemma follows immediately from the definitions of the occupancy measure and value of a policy. Lemma 1. s If policy has occupancys measure x , then i i V ( ) = ,a Rsa xsa and V ( ) = ,a Rsa xsa . such that s a xsa = s + Then V ( ) = V ( ). ^ ~ xs as as ,a (5) = V ( ). ~ where these equalities use, in order: Lemma 1; the definition of x; Lemma 1; the definition of a mixed policy. ^ Proof. By Theorem 2, xj satisfies the Bellman flow conjjj straints (5) - (6) for all j . Let xsa = ^ xsa . By linearity, x also satisfies the Bellman flow constraints. Hence, ^ by Theorem 2, the stationary policy defined by sa = ^ ^ xsa ^ a has occupancy measure x. Therefore, ^ xsa ^ j s j s j V ( j ) Rsa xj a = j Rsa xsa = ^ V ( ) = ^ s ,a ,a 6. MWAL-Dual Algorithm In this section, we will make a minor modification to the MWAL algorithm so that it outputs a stationary policy instead of a mixed policy. Recall that the MWAL algorithm requires, in Steps 6 and 7, a way to compute an optimal policy and its basis values, but that no particular methods are prescribed. Our proposal is to use the Dual LP method in Step 6 to find the occupancy measure xt of a policy t that is -optimal for reiti ward function Rsa = wi Rsa . Then in Step 7 we let 5. Main Theoretical Tools Recall that the MWAL algorithm produces mixed policies. In Sections 6 and 7, we will present algorithms that achieve Apprenticeship Learning Using Linear Programming implies V Now we can apply Theorem 3 to combine all the policies computed during the MWAL algorithm into a single stationary apprentice policy. This amounts to changing Step 11 to the following: Return: Let apprentice policy A be the stationary policy defined by tt 1 x A a sa = T 1 tsa t . xsa T V i,t = ,a i,t s i Rsa xt a , for i = 1 . . . k . Note that Lemma 1 s i t = V ( ). ion than the MWAL algorithm. Recall the objective function proposed in (Syed & Schapire, 2008) for solving apprenticeship learning: v = max min V i ( ) - V i ( E ) i (7) We observed earlier that, if A is a solution to (7), then V ( A ) V ( E ) + v , and that v 0. In this section, we describe a linear program that solves (7). In Section 8, we describe experiments that show that this approach is much faster than the MWAL algorithm, although it does have some disadvantages, which we also illustrate in Section 8. Our LPAL (Linear Programming Apprenticeship Learning) algorithm is given in Algorithm 2. The basic idea is to use the Bellman flow constraints (5) - (6) and Lemma 1 to define a feasible set containing all (occupancy measures of) stationary policies whose basis values are above a certain lower bound, and then maximize this bound. Algorithm 2 LPAL algorithm 1: Given: S , A, , , , R1 . . . Rk , D . 2: Using the expert's sample trajectories D , compute an We call this modified algorithm the MWAL-Dual algorithm, after the method it uses to compute optimal policies. It is straightforward to show that these changes to the MWAL algorithm do not affect its performance guarantee. Theorem 4. Let A be the stationary policy returned by the MWAL-Dual algorithm. If l t og k T O ( (1 - ))2 hen V ( ) V ( ) + v - O( ) where v = max mini V i ( ) - V i ( E ). Proof. By Theorem 3, the stationary policy returned by the MWAL-Dual algorithm has the same value as the mixed policy returned by the original MWAL algorithm. Hence the guarantee in Theorem 1 applies to the MWAL-Dual algorithm as well. Of course, the trick used here to convert a mixed policy to a stationary one is completely general, provided that the occupancy measures of the component policies can be computed. For example, this technique could be applied to the mixed policy output by the algorithms due to Abbeel & Ng (2004). Let T (n) be the worst-case running time of an LP solver on a problem with at most n constraints and variables.2 For a typical LP solver, T (n) = O(n3.5 ) (Shu-Cherng & Puthenpura, 1993), although they tend to be much faster in practice. Using this notation, we can bound the running time of Steps 6 and 7 in the MWAL-Dual algorithm. Finding an optimal policy using the Dual LP method takes T (|S ||A|) time. And by Lemma 1, given the occupancy measure of a policy, we can compute its basis values in O (k |S ||A|) time. A E -good estimate V i,E of V i ( E ), for all i. 3: Find a solution (B , x ) to this linear program: max B B ,x (8) such that B a s ,a xsa = s + xsa 0 i Rsa xsa - V i,E (9) (10) (11) s xs a s as ,a 4: Return: Let apprentice policy A be the stationary policy defined by A s a = xa as . xsa Theorem 5. Let A be the stationary policy returned by the LPAL algorithm. Then V ( A ) V ( E ) + v - O( ) where v = max mini V i ( ) - V i ( E ). Proof. By Theorem 2, the Bellman flow constraints (10) (11) imply that all feasible x correspond to the occupancy measure of some stationary policy . Using this fact and Lemma 1, we conclude that solving the linear program is equivalent to finding (B , A ) such that B = min V i ( A ) - V i,E i 7. LPAL Algorithm We now describe a way to use the Bellman flow constraints to find a good apprentice policy in a much more direct fash2 Technically, the time complexity of a typical LP solver also depends on the number of bits in the problem representation. and B is as large as possible. Since |V i,E - V i ( E )| f or all i, we know that B v - . Together with (9) and Apprenticeship Learning Using Linear Programming Lemma 1 this implies s i V i ( A ) = Rsa xa V i,E + B V i ( E ) + v - 2 . s ,a Table 2. Time (sec) to find A s. t. V ( A ) 0.95V ( E ) Gridworld Size 24 × 24 32 × 32 Number of Regions 64 144 576 64 256 1024 64 144 256 576 2304 MWAL-VI (sec) 14.45 32.33 129.87 27.23 107.11 440.64 61.37 135.83 244.46 575.34 2320.71 MWAL-PI (sec) 10.27 20.06 75.81 15.04 60.24 267.12 35.33 79.88 150.08 352.15 1402.10 MWAL-Dual (sec) 90.16 97.58 120.82 247.38 270.71 361.36 791.61 800.23 815.66 847.38 1128.32 LPAL (sec) 1.55 2.64 1.86 2.76 8.43 4.75 8.62 11.42 16.89 16.33 11.14 Note that the overall worst-case running time of the LPAL algorithm is T (|S ||A| + k ), where T (n) is the complexity of an LP solver. 48 × 48 8. Experiments 8.1. Gridworld We tested each algorithm in gridworld environments that closely resemble those in the experiments of Abbeel & Ng (2004). Each gridworld is an N × N square of states. Movement is possible in the four compass directions, and each action has a 30% chance of causing a transition to a random state. Each gridworld is partitioned into several square regions, each of size M × M . We always choose M so that it evenly divides N , so that each gridworld has N k = ( M )2 regions. Each gridworld also has k basis reward functions, where the ith basis reward function Ri is a 0-1 indicator function for the ith region. For each gridworld, in each trial, we randomly chose a sparse weight vector w . Recalil that the true reward func tion has the form R(s) = wi Ri (s), so in these experiments the true reward function just encodes that some regions are more desirable than others. In each trial, we let the expert policy E be the optimal policy with respect to R, and then supplied the basis values V i ( E ), for all i, to the MWAL-VI, MWAL-PI, MWAL-Dual and LPAL algorithms.3 Our experiments were run on an ordinary desktop computer. We used the Matlab-based cvx package (Grant & Boyd, 2008) for our LP solver. Each of the values in the tables below is the time, in seconds, that the algorithm took to find an apprentice policy A such that V ( A ) 0.95V ( E ). Each running time is the average of 10 trials. Table 1. Time (sec) to find A s. t. V ( A ) 0.95V ( E ) Gridworld Size 16 × 16 24 × 24 32 × 32 48 × 48 64 × 64 128 × 128 256 × 256 MWAL-VI (sec) 6.43 14.45 27.23 61.37 114.12 406.24 1873.93 MWAL-PI (sec) 5.78 10.27 15.04 35.33 85.26 307.58 1469.56 MWAL-Dual (sec) 46.99 90.16 247.38 791.61 3651.70 4952.74 29988.85 LPAL (sec) 1.46 1.55 2.76 8.62 30.52 80.21 588.60 In the first set of experiments (Table 1), we tested the algorithms in gridworlds of varying sizes, while keeping the number of regions in each gridworld fixed (64 regions). Recall that the number of regions is equal to the number of basis reward functions. In the next set of experiments (Table 2), we varied the number of regions while keeping the size of the gridworld fixed. Several remarks about these results are in order. For every gridworld size and every number of regions, the LPAL algorithm is substantially faster than the other algorithms -- in some cases two orders of magnitude faster. As we previously noted, LP solvers are often much more efficient than their theoretical guarantees. Interestingly, in Table 2, the running time for LPAL eventually decreases as the number of regions increases. This may be because the number of constraints in the linear program increases with the number of regions, and more constraints often make a linear program problem easier to solve. Also, the MWAL-Dual algorithm is much slower than the other algorithms. We suspect this is only because the MWAL-Dual algorithm calls the LP solver in each iteration (unlike the LPAL algorithm, which calls it just once), and there is substantial overhead to doing this. Modifying MWAL-Dual so that it uses the LP solver as less of a blackbox may be a way to alleviate this problem. 8.2. Car driving In light of the results from the previous section, one might reasonably wonder whether there is any argument for using an algorithm other than LPAL. Recall that, in those experiments, the expert's policy was an optimal policy for the unknown reward function. In this section we explore the behavior of each algorithm when this is not the case, and find that MWAL produces better apprentice policies than LPAL. Our experiments were run in a car driving simulator modeled after the environments in (Abbeel & Ng, 2004) and (Syed & Schapire, 2008). The task in our driving simulator is to navigate a car on a busy three-lane highway. The available actions are to move left, move right, drive faster, or drive slower. There are three basis reward functions that map each environment state to a numerical reward: collision (0 if contact with an- 3 Typically in practice, E will be unknown, and so the basis values would need to be estimated from the data set of expert sample trajectories D. However, since we are primarily concerned with computational complexity in this work, and not sample complexity, we sidestep this issue and just compute each V i ( E ) directly. Apprenticeship Learning Using Linear Programming other car, and 1/2 otherwise), off-road (0 if on the grass, and 1/2 otherwise), and speed (1/2, 3/4 and 1 for each of the three possible speeds, with higher values corresponding to higher speeds). The true reward function is assumed to be some unknown weighted combination w of the basis reward functions. Since the weights are assumed to be positive, by examining the basis reward functions we see that the true reward function assigns higher reward to states that are intuitively "better". We designed three experts for these experiments, described in Table 3. Each expert is optimal for one of the basis reward functions, and mediocre for the other two. Therefore each expert policy E is an optimal policy if w = wE , where wE is the weight vector that places all weight on the basis reward function for which E is optimal. At the same time, each E is very likely to be suboptimal for a randomly chosen w . We used the MWAL and LPAL algorithms to learn apprentice policies from each of these experts.4 The results are presented in Table 4. We let = 0.9, so the maximum value of the basis value function corresponding to speed was 10, and for the others it was 5. Each of the reported policy values for randomly chosen w was averaged over 10,000 uniformly sampled w 's. Notice that for each expert, when w is chosen randomly, MWAL outputs better apprentice policies than LPAL. Table 3. Expert types Speed "Fast" expert "Avoid" expert "Road" expert Fast Slow Slow Collisions (per sec) 1.1 0 1.1 Off-roads (per sec) 10 10 0 whenever a simple and easily interpretable apprentice policy is desired. On the other hand, we also presented evidence that LPAL performs poorly when the expert policy is far from an optimal policy for the true reward function. If one suspects in advance that this may be the case, then one of the MWAL variants would be a better choice for a learning algorithm. Among these variants, only MWAL-Dual produces a stationary policy, although it has the drawback of being the slowest algorithm that we tested. Although the theoretical performance guarantees of both the MWAL and LPAL algorithm are identical, the results in Table 4 suggest that the two algorithms are not equally effective. It seems possible that the current theoretical guarantees for the MWAL algorithm are not as strong as they could be. Investigation of this possibility is ongoing work. One way to describe the poor performance of the LPAL algorithm versus MWAL is to say that, when there are several policies that are better than the expert's policy, the LPAL algorithm fails to optimally break these "ties". This characterization suggests that recent techniques for computing robust strategies in games (Johanson et al., 2008) may be an avenue for improving the LPAL algorithm. It would also be interesting to examine practically and theoretically how apprenticeship learning can be combined with MDP approximation techniques. In particular, the dual linear programming approach in this work might combine nicely with recent work on stable MDP approximation techniques based on the dual form (Wang et al., 2008). Acknowledgements We would like to thank Michael Littman, Warren Powell, Michele Sebag and the anonymous reviewers for their helpful comments. This work was supported by the NSF under grant IIS-0325500. V ( E ) 8.25 8.25 6.32 6.32 7.49 7.49 Table 4. Driving simulator experiments. Expert type "Fast" "Avoid" "Road" Algorithm used MWAL LPAL MWAL LPAL MWAL LPAL w = wE V ( A ) 10 10 5 5 5 5 V ( E ) 10 10 5 5 5 5 w chosen randomly V ( A ) 9.83 8.84 8.76 7.26 9.74 8.12 References Abbeel, P., & Ng, A. (2004). Apprenticeship learning via inverse reinforcement learning. Proceedings of the International Conference on Machine Learning. Feinberg, E. A., & Schwartz, A. (2002). Handbook of Markov Decision Processes: Methods and Applications. Springer. Grant, M., & Boyd, S. (2008). CVX: Matlab software for disciplined convex programming (web page and software). http://stanford.edu/boyd/cvx. Horn, R. A., & Johnson, C. R. (1985). Matrix Analysis. Cambridge University Press. Johanson, M., Zinkevich, M., & Bowling, M. (2008). Computing robust counter-strategies. Advances in Neural Information Processing Systems. Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. John Wiley and Sons. 9. Conclusion and Future Work Each of the algorithms for apprenticeship learning presented here have advantages and disadvantages that make them each better suited to different situations. As our experiments showed, the LPAL algorithm is much faster than any of the MWAL variants, and so is most appropriate for problems with large state spaces or many basis reward functions. And unlike the original MWAL algorithm, it produces a stationary policy, which make it a good choice Each of the MWAL variants behaved in exactly the same way in this experiment. The results presented are for the MWAL-PI variant. 4 Apprenticeship Learning Using Linear Programming Ratliff, N. D., Bagnell, J. A., & Zinkevich, M. A. (2006). Maximum margin planning. Proceedings of the International Conference on Machine Learning. Shu-Cherng, & Puthenpura, S. (1993). Linear Optimization and Extensions: Theory and Algorithms. Prentice Hall. Syed, U., & Schapire, R. E. (2008). A game-theoretic approach to apprenticeship learning. Advances in Neural Information Processing Systems. Wang, T., Lizotte, D., Bowling, M., & Schuurmans, D. (2008). Stable dual dynamic programming. Advances in Neural Information Processing Systems. Now we show that the solution to the -specific Bellman flow constraints given by Lemma 2 is unique. Lemma 3. For any stationary policy , the -specific Bellman flow constraints (12) - (13) have at most one solution. Proof. Define the matrix A(sa,s a ) 1 - s a s s a - s a s s a if (s, a) = (s , a ) otherwise. and the vector bsa sa s . (Note that A and b are indexed by state-action pairs.) We can re-write (12) - (13) equivalently as Ax = b x0 (14) (15) 10. Appendix This is a proof of Theorem 2. Before proceeding, we introduce another linear system. For any stationary policy , the -specific Bellman flow constraints are given by the following linear system in which the xsa variables are unknown: s xsa = sa s + sa (12) s a s a s ,a x The matrix A is cosumn-wise strictly diagonally dominant. l a sa = 1 and < 1, so This is because sas = 1, for all s , a s s a s s a = < 1 ,a xsa 0 (13) 1 - s a s s a > The next lemma shows that -specific Bellman flow constraints have a solution. Lemma 2. For any stationary policy , the occupancy measure x of satisfies the -specific Bellman flow constraints (12) - (13). Proof. Clearly, xa is non-negative for all s, a, and so (13) s is satisfied. As for (12), we simply plug in the definition of xsa from (1). (In the following derivation, all the expectations and probabilities are conditioned on , , and . They have been omitted from the notation for brevity.) xa = E s = " X t=0 |A(s a ,s a ) | > ( ( s a s s a s,a)=(s ,a ) |A(sa,s a ) |. s,a)=(s ,a ) where the last line is the definition of column-wise strict diagonal dominance. This implies that A is non-singular (Horn & Johnson, 1985), so (14) - (15) has at most one solution. We are now ready to prove Theorem 2. Proof of Theorem 2. For the first direction of the theorem, we assume that x satisfies the Bellman flow constraints (5) xsa - (6), and that sa = a . Therefore, xsa x s sa . (16) s a = s + ,a xs a s a s t 1(st =sat =a) # X t=0 t Pr(st = s, at = a) X t=0 = sa s + t+1 Pr(st+1 = s, at+1 = a) Clearly x is a solution to the -specific Bellman flow constraints (12) - (13), and Lemmas 2 and 3 imply that x is the occupancy measure of . = sa s X t+1 X + r(st = s , at = a , st+1 = s, at+1 = a) t=0 s ,a P = sa s + X t=0 t+1 X s ,a P X r(st = s , at = a ) · s t as sa = sa s + sa = sa s + sa s ,a E s ,a x X sa " X t=0 1(st =s a =a ) t # s as For the other direction of the theorem, we assume that x is the occupancy measure of . Lemmas 2 and 3 imply that x is the unique solution to the -specific Bellman flow constraints (12) - (13). Therefore, is given by (16). And a since sa = 1, we have a x s sa =1 s + ,a xs a s a s which can be rearranged to show that x satisfies the Bellman flow constraints, and also combined with (16) to show xsa that sa = a . xsa sas