A Game-Theoretic Approach to Apprenticeship Learning Umar Syed Computer Science Department Princeton University 35 Olden St Princeton, NJ 08540-5233 usyed@cs.princeton.edu Robert E. Schapire Computer Science Department Princeton University 35 Olden St Princeton, NJ 08540-5233 schapire@cs.princeton.edu Abstract We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert's, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert's. Moreover, our algorithm is computationally much faster, is easy to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1 Introduction When an agent is faced with the task of learning how to behave in a stochastic environment, a common approach is to model the situation using a Markov Decision Process. An MDP consists of states, actions, rewards and a transition function. Once an MDP has been provided, the usual objective is to find a policy (i.e. a mapping from states to actions) that maximizes expected cumulative reward collected by the agent. Building the MDP model is usually the most difficult part of this process. One reason is that it is often hard to correctly describe the environment's true reward function, and yet the behavior of the agent is quite sensitive to this description. In practice, reward functions are frequently tweaked and tuned to elicit what is thought to be the desired behavior. Instead of maximizing reward, another approach often taken is to observe and follow the behavior of an expert in the same environment. Learning how to behave by observing an expert has been called apprenticeship learning, with the agent in the role of the apprentice. Abbeel and Ng [1] proposed a novel and appealing framework for apprenticeship learning. In this framework, the reward function, while unknown to the apprentice, is assumed to be equal to a linear combination of a set of known features. They argued that while it may be difficult to correctly describe the reward function, it is usually much easier to specify the features on which the reward function depends. 1 With this setting in mind, Abbeel and Ng [1] described an efficient algorithm that, given enough examples of the expert's behavior, produces a policy that does at least as well as the expert with respect to the unknown reward function. The number of examples their algorithm requires from the expert depends only moderately on the number of features. While impressive, a drawback of their results is that the performance of the apprentice is both upperand lower-bounded by the performance of the expert. Essentially, their algorithm is an efficient method for mimicking the expert's behavior. If the behavior of the expert is far from optimal, the same will hold for the apprentice. In this paper, we take a somewhat different approach to apprenticeship learning that addresses this issue, while also significantly improving on other aspects of Abbeel and Ng's [1] results. We pose the problem as learning to play a two-player zero-sum game in which the apprentice chooses a policy, and the environment chooses a reward function. The goal of the apprentice is to maximize performance relative to the expert, even though the reward function may be adversarially selected by the environment with respect to this goal. A key property of our algorithm is that it is able to leverage prior beliefs about the relationship between the features and the reward function. Specifically, if it is known whether a feature is "good" (related to reward) or "bad" (inversely related to reward), then the apprentice can use that knowledge to improve its performance. As a result, our algorithm produces policies that can be significantly better than the expert's policy with respect to the unknown reward function, while at the same time are guaranteed to be no worse. Our approach is based on a multiplicative weights algorithm for solving two-player zero-sum games due to Freund and Schapire [2]. Their algorithm is especially well-suited to solving zero-sum games in which the "game matrix" is extremely large. It turns out that our apprenticeship learning setting can be viewed as a game with this property. Our results represent a strict improvement over those of Abbeel and Ng [1] in that our algorithm is considerably simpler, provides the same lower bound on the apprentice's performance relative to the expert, and removes the upper bound on the apprentice's performance. Moreover, our algorithm requires substantially less computational expense ­ specifically, we are able to achieve their performance guarantee after only O(ln k ) iterations, instead of the O(k ln k ), where k is the number of features on which the reward function depends. Additionally, our algorithm can be applied to a setting in which no examples are available from the expert. In that case, our algorithm produces a policy that is optimal in a certain conservative sense. We are also able to extend our algorithm to a situation where the MDP's transition function is unknown. We conducted experiments from a small car driving simulation that illustrate some of our theoretical findings. 2 Preliminaries Our problem setup largely parallels that outlined in Abbeel and Ng [1]. We are given an infinitehorizon Markov Decision Process in which the reward function has been replaced by a set of features. Specifically, we are given an MDP\R M = (S , A, , D, , ), consisting of finite state and action sets S and A, discount factor , initial state distribution D, transition function (s, a, s ) Pr(st+1 = s | st = s, at = a), and a set of k features defined by the function : S Rk . The true reward function R is unknown. For ease of exposition, we assume that R (s) = w · (s), for some w Rk , although we also show how our analysis extends to the case when this does not hold. For any policy in M , the value of (with respect to the initial state distribution) is defined by . t t V ( ) E R (st ) , , D =0 where the initial state s0 is chosen according to D, and the remaining states are chosen according to and . We also define a k -length feature expectations vector, . t t µ( ) E (st ) , , D =0 2 From its definition, it should be clear that "feature expectations" is a (somewhat misleading) abbreviation for "expected, cumulative, discounted feature values." Importantly, since R (s) = w · (s), we have V ( ) = w · µ( ), by linearity of expectation. ^ We say that a feature expectations vector µ is an -good estimate of µ( ) if µ - µ( ) . ^ Likewise, we say that a policy is -optimal for M if |V ( ) - V ( )| , where is an optimal ^ ^ policy for M , i.e. = arg max V ( ).1 We also assume that there is a policy E , called the expert's policy, which we are able to observe executing in M . Following Abbeel and Ng [1], our goal is to find a policy such that V ( ) V (E ) - , even though the true reward function R is unknown. We also have the additional goal of finding a policy when no observations from the expert's policy are available. In that case, we find a policy that is optimal in a certain conservative sense. Like Abbeel and Ng [1], the policy we find will not necessarily be stationary, but will instead be a mixed policy. A mixed policy is a distribution over , the set of all deterministic stationary policies in M . Because is finite (though extremely large), we can fix a numbering of the policies in , which we denote 1 , . . . , || . This allows us to treat as a vector, where (i) is the probability assigned to i . A mixed policy is executed by randomly selecting the policy i at time 0 with probability (i), and exclusively following i thereafter. It should be noted that the definitions of value and feature expectations apply to mixed policies as well: V ( ) = Ei [V ( i )] and µ( ) = Ei [µ( i )]. Also note that mixed policies do not have any advantage over stationary policies in terms of value: if is an optimal stationary policy for M , and is an optimal mixed policy, then V ( ) = V ( ). The observations from the expert's policy E are in the form of m independent trajectories in M , each for simplicity of the same length H . A trajectory is just the sequence of states visited by the s f expert: i , si , . . . , si or the ith trajectory. Let µE = µ(E ) be the expert's feature expectations. 01 H ^ We compute an estimate µE of µE by averaging the observed feature values from the trajectories: ^ µE = mH 1i t t (si ). t m =0 =0 3 Review of the Projection Algorithm We compare our approach to the "projection algorithm" of Abbeel and Ng [1], which finds a policy that is at least as good as the expert's policy with respect to the unknown reward function.2 Abbeel and Ng [1] assume that (s) [0, 1]k , and that R (s) = w · (s) for some w Bk , where Bk = {w : w 1 1}. Given m independent trajectories from the expert's policy, the projection algorithm runs for T iterations. It returns a mixed policy such that µ() - µE 2 as long as T and m are sufficiently large. In other words, their algorithm seeks to "match" the expert's feature expectations. The value of will necessarily be close to that of the expert's policy, since |V ( ) - V (E )| = |w · µ( ) - w · µE | w w 2 µ( ) - µE 2 (1) here in Eq. (1) we used the Cauchy-Schwartz inequality and w 2 w 1 1. The following theorem is the main result in Abbeel and Ng [1]. However, some aspects of their analysis are not covered by this theorem, such as the complexity of each iteration of the projection algorithm, and the sensitivity of the algorithm to various approximations. These are discussed immediately below. Theorem 1 (Abbeel and Ng [1]). Given an MDP\R, and m independent trajectories from an expert's policy E . Suppose we execute the projection algorithm for T iterations. Let be the mixed Note that this is weaker than the standard definition of optimality, as the policy only needs to be optimal with respect to the initial state distribution, and not necessarily at every state simultaneously. 2 Abbeel and Ng [1] actually presented two algorithms for this task. Both had the same theoretical guarantees, but the projection algorithm is simpler and was empirically shown to be slightly faster. 1 3 policy returned by the algorithm. Then in order for |V ( ) - V (E )| to hold with probability at least 1 - , it suffices that k k T O ln a (1 - )2 2 (1 - ) nd3 m 2k 2k ln . ( (1 - ))2 (2) We omit the details of the algorithm due to space constraints, but note that each iteration involves only two steps that are computationally expensive: 1. Find an optimal policy with respect to a given reward function. 2. Compute the feature expectations of a given policy. The algorithm we present in Section 5 performs these same expensive tasks in each iteration, but requires far fewer iterations -- just O(ln k ) rather than O(k ln k ), a tremendous savings when the number of features k is large. Also, the projection algorithm has a post-processing step that requires invoking a quadratic program (QP) solver. Comparatively, the post-processing step for our algorithm is trivial. Abbeel and Ng [1] provide several refinements of the analysis in Theorem 1. In particular, suppose that each sample trajectory has length H (1/(1 - )) ln(1/( H (1 - ))), and that an P -optimal policy is found in each iteration of the projection algorithm (see Step 1 above). Also let R = minwBk maxs |R (s) - w · (s)| be the "representation error" of the features. Abbeel and Ng [1] comment at various points in their paper that H , P , and O( R) should be added to the error bound of Theorem 1. In Section 5 we provide a unified analysis of these error terms in the context of our algorithm, and also incorporate an F term that accounts for computing an F -good feature expectations estimate in Step 2 above. We prove that our algorithm is sensitive to these error terms in a similar way as the projection algorithm. 4 Apprenticeship Learning via Game Playing Notice the two-sided bound in Theorem 1: the theorem guarantees that the apprentice will do almost as well as the expert, but also almost as badly. This is because the value of a policy is a linear combination of its feature expectations, and the goal of the projection algorithm is to match the expert's feature expectations. We will take a different approach. We assume that (s) [-1, 1]k , and that R (s) = w · (s) for some w Sk , where Sk = {w Rk : w 1 = 1 and w 0}.4 The impact of this minor change in the domains of w and is discussed further in Section 5.2. Let be the set of all mixed policies in M . Now consider the optimization v = max min [w · µ( ) - w · µE ] . wSk (3) Our goal will be to find (actually, to approximate) the mixed policy that acheives v . Since V ( ) = w · µ( ) for all , we have that is the policy in that maximizes V ( ) - V (E ) with respect to the worst-case possibility for w . Since w is unknown, maximizing for the worstcase is appropriate. We begin by noting that, because w and are both distributions, Eq. (3) is in the form of a twoperson zero-sum game. Indeed, this is the motivation for redefining the domain of w as we did. The quantity v is typically called the game value. In this game, the "min player" specifies a reward Actually, the analysis can be tightened so that the sample complexity is only O(ln k) instead of O(k ln k). [Abbeel, personal communication] 4 We use to denote componentwise inequality. Likewise, we use to denote strict inequality in every component. 3 4 function by choosing w, and the "max player" chooses a mixed policy . The goal of the min player is to cause the max player's policy to perform as poorly as possible relative to the expert, and the max player's goal is just the opposite. A game is defined by its associated game matrix. In our case, the game matrix is the k × || matrix G(i, j ) = µj (i) - µE (i) where µ(i) is the ith component of µ and we have let µj = µ( j ) be the vector of feature expectations for the j th deterministic policy j . Now Eq. (3) can be rewritten in the form v = max min wT G . wSk (4) In Eq. (3) and (4), the max player plays first, suggesting that the min player has an advantage. However, the well-known minmax theorem of von Neumann says that we can swap the min and max operators in Eq. (4) without affecting the game value. In other words, v = max min wT G = min max wT G . wSk wSk (5) Finding will not be useful unless we can establish that v 0, i.e. that will do at least as well as the expert's policy with respect to the worst-case possibility for w . This fact is not immediately clear, since we are restricting ourselves to mixtures of deterministic policies, while we do not assume that the expert's policy is deterministic. However, note that in the rightmost expression in Eq. (5), the maximization over is done after w -- and hence the reward function -- has been fixed. So the maximum is achieved by the best policy in with respect to this fixed reward function. Note that if this is also an optimal policy, then v will be nonnegative. It is well-known that in any MDP there always exists a deterministic optimal policy. Hence v 0. In fact, we may have v > 0. Suppose it happens that µ( ) µ(E ). Then will dominate E , i.e. will have higher value than E regardless of the actual value of w , because we assumed that w 0. Essentially, by assuming that each component of the true weight vector is nonnegative, we are assuming that we have correctly specified the "sign" of each feature. This means that, other things being equal, a larger value for each feature implies a larger reward. So when v > 0, the mixed policy to some extent ignores the expert, and instead exploits prior knowledge about the true reward function encoded by the features. To state things simply, under our approach, if it is possible to acheive much higher average values than the expert for all the features, without needing to trade-off among them, then v will be much larger than zero. However, if one cannot achieve much higher average values than the expert for all the features simultaneously, and instead one is forced to optimize for some of them at the expense of others, then v will be close to zero. In the former case, can be quite different from the expert's policy, while in the latter case will tend to mimic the expert. In all cases, is guaranteed to do no worse than the expert, since v 0. We present experimental results that explore this aspect of our approach in Section 7. 5 The Multiplicative Weights for Apprenticeship Learning (MWAL) Algorithm In the previous section, we motivated the goal of finding the mixed policy that maximizes Eq. (3) (or equivalently, maximizes Eq. (4)). In this section we present an efficient algorithm for solving this optimization problem. Recall the game formulated in the previous section. In the terminology of game theory, w and are called strategies for the min and max player respectively , and is called an optimal strategy for the max player. Also, a strategy w is called pure if w(i) = 1 for some i. Typically, one finds an optimal strategy for a two-player zero-sum game by solving a linear program. However, the complexity of that approach scales with the size of the game matrix. In our case, the game matrix G is huge, since it has as many columns as the number of deterministic policies in the MDP\R. Freund and Schapire [2] described a multiplicative weights algorithm for finding approximately optimal strategies in games with large or even unknown game matrices. To apply their algorithm to a game matrix G, it suffices to be able to efficiently perform the following two steps: 5 1. Given a min player strategy w, find arg max wT G . 2. Given a max player strategy , compute wT G for each pure strategy w. Observe that these two steps are equivalent to the two steps of the projection algorithm from Section 3. Step 1 amounts to finding the optimal policy in a standard MDP with a known reward function. There are a huge array of techniques available for this, such as value iteration and policy iteration. Step 2 is the same as computing the feature expectations of a given policy. These can be computed exactly by solving k systems of linear equations, or they can be approximated using iterative techniques. Importantly, the complexity of both steps scales with the size of the MDP\R, and not with the size of the game matrix G. Our Multiplicative Weights for Apprenticeship Learning (MWAL) algorithm is described below. Lines 7 and 8 of the algorithm correspond to Steps 1 and 2 directly above. The algorithm is essentially the MW algorithm of Freund and Schapire [2], applied to a game matrix very similar to G. ^ The differences are that the new game matrix is defined in terms of µE instead of µE , and is also shifted and scaled (to simplify the algorithm's proof of correctness). We have also slightly extended the results from Freund and Schapire [2] to allow the MWAL algorithm, in lines 7 and 8, to estimate the optimal policy and its feature expectations, rather than requiring that they be computed exactly. Algorithm 1 The MWAL algorithm ^ 1: Given: An MDP\R M and an estimate of the expert's feature expectations µE . 1 -1 2 ln k 2: Let = + . T ^ 3: Define G(i, µ) ((1 - )(µ(i) - µE (i)) + 2)/4, where µ Rk . 4: Initialize W (1) (i) = 1 for i = 1, . . . , k . 5: for t = 1, . . . , T do 6: Set w(t) (i) = W (t) (i) P (t) (i) iW for i = 1, . . . , k . 7: Compute an P -optimal policy (t) for M with respect to reward function R(s) = w(t) · (s). ^ ^ 8: Compute an F -good estimate µ(t) of µ(t) = µ( (t) ). ^ ^ 9: W (t+1) (i) = W (t) (i) · exp(ln( ) · G(i, µ(t) )) for i = 1, . . . , k . 10: end for 1 11: Post-processing: Return the mixed policy that assigns probability T to (t) , for all t ^ {1, . . . , T }. Theorem 2 below provides a performance guarantee for the mixed policy returned by the MWAL algorithm, relative to the performance of the expert and the game value v . Its correctness is largely based on the main result in Freund and Schapire [2]. A proof is available in the supplement [5]. Theorem 2. Given an MDP\R M , and m independent trajectories from an expert's policy E . Suppose we execute the MWAL algorithm for T iterations. Let be the mixed policy returned by the algorithm. Let F and P be the approximation errors from lines 7 and 8 of the algorithm. Let H (1/(1 - )) ln(1/( H (1 - ))) be the length of each sample trajectory. Let R = minwSk maxs |R (s) - w · (s)| be the representation error of the features. Let v = max minwSk [w · µ( ) - w · µE ] be the game value. Then in order for V ( ) V (E ) + v - to hold with probability at least 1 - , it suffices that T 9 ln k ( (1 - ))2 1 2k ln 2( (1 - ))2 (7) (8) (9) where - F + P + 2 H + (2 R)/(1 - ) (3 - ) 6 (10) (6) m Note the differences between Theorem 1 and Theorem 2. Because v 0, the guarantee of the MWAL algorithm in (6) is at least as strong as the guarantee of the projection algorithm in (2), and has the further benefit of being one-sided. Additionally, the iteration complexity of the MWAL algorithm is much lower. This not only implies a faster run time, but also implies that the mixed policy output by the MWAL algorithm consists of fewer stationary policies. And if a purely stationary policy is desired, it is not hard to show that the guarantee in (6) must hold for at least one of the stationary polices in the mixed policy (this is also true of the projection algorithm [1]). The sample complexity in the Theorem 2 is also lower, but we believe that this portion of our analysis applies to the projection algorithm as well [Abbeel, personal communication], so the MWAL algorithm does not represent an improvement in this respect. 5.1 When no expert is available Our game-playing approach can be very naturally and easily extended to the case where we do not have data from an expert. Instead of finding a policy that maximizes Eq. (3), we find a policy that maximizes max min [w · µ( )] . (11) wSk Here is the best policy for the worst-case possibility for w . The MWAL algorithm can be trivially adapted to find this policy just by setting µE = 0 (compare (11) to (3)). The following corollary follows straightforwardly from the proof of Theorem 2. Corollary 1. Given an MDP\R M . Suppose we execute the `no expert' version of the MWAL algorithm for T iterations. Let be the mixed policy returned by the algorithm. Let F , P , R be defined as in Theorem 2. Let v = max minwSk [w · µ( )]. Then V ( ) v - if T where - 5.2 Representation error 9 ln k ( (1 - ))2 (13) (12) F + P + (2 R)/(1 - ) . (3 - ) (14) Although the MWAL algorithm makes different assumptions about the domains of w and than the projection algorithm, these differences are not of any significant consequence. The same class of reward functions can be expressed under either set of assumptions by roughly doubling the number of features. Concretely, consider a feature function that satisfies the assumptions of the projection algorithm. Then for each s, if (s) = (f1 , . . . , fk ), define (s) = (f1 , . . . , fk , -f1 , . . . , -fk , 0). Observe that satisfies the assumptions of the MWAL algorithm, and that minwBk maxs |R (s) - w · (s)| minwS2k+1 maxs |R (s) - w · (s)|. So by only doubling the number of features, we can ensure that the representation error R does not increase. Notably, employing this reduction forces the game value v to be zero, ensuring that the MWAL algorithm, like the projection algorithm, will mimic the expert. This obsevation provides us with some useful guidance for selecting features for the MWAL algorithm: both the original and negated version of a feature should be used if we are uncertain how that feature is correlated with reward. 6 When the transition function is unknown In the previous sections, as well as in Abbeel and Ng [1], it was assumed that the transition function (s, a, ·) was known. In this section we sketch how to remove this assumption. Our approach to applying the MWAL algorithm to this setting can be informally described as follows: Let M = (S , A, , , ) be the true MDP\R for which we are missing . Consider the MLE estimate of that is formed from the expert's sample trajectories. Let Z S × A be the set of state-action pairs 7 that are visited "most frequently" by the expert. Then after observing enough trajectories, will be an accurate estimate of on Z . We form a pessimistic estimate MZ of M by using to model the transitions in Z , and route all other transitions to a special "dead state." Following Kearns and Singh [3], who used a very similar idea in their analyis of the E 3 algorithm, we call MZ the induced MDP\R on Z . By a straightforward application of several technical lemmas due to Kearns and Singh [3] and Abbeel and Ng [4], it is possible to show that if the number of expert trajectories m is at least 3 |3 | O( |S8 |A| ln |S | |A| + |S ||A| ln 2|S|A| ), and we let Z be the set of state-action pairs visited by the 3 2 3 | expert at least O( 1S | 2 ln |S | |A| ) times, then using MZ in place of M in the MWAL algorithm will 6 add only O( ) to the error bound in Theorem 2. More details are available in the supplement [5], including a precise procedure for constructing MZ . 7 Experiments We compare the performance of the MWAL algorithm to the projection algorithm in a car driving simulator, an experimental setup similar to Abbeel and Ng [1]. More details of the experiments discussed below, including videos, are available in the supplement [5]. In our simulator, the apprentice must navigate a car through randomly-generated traffic on a threelane highway. We define three features for this environment: a collision feature (0 if contact with another car, and 1/2 otherwise), an off-road feature (0 if on the grass, and 1/2 otherwise), and a speed feature (1/2, 3/4 and 1 for each of the three possible speeds, with higher values corresponding to higher speeds). Note that the features encode that, other things being equal, speed is good, and collisions and off-roads are bad. Speed Collisions (per sec) Off-roads (per sec) Fast Exper t High 1.1 0 MWAL High 0.5 0 Proj. High 1.1 0 Bad Exper t Slow 2.23 8.0 MWAL Medium 0 0 Proj. Slow 2.23 8.0 No Exper t MWAL Medium 0 0 Table 7 displays the results of using the MWAL and projection algorithms to learn a driving policy by observing two kinds of experts: a "fast" expert (drives at the highest speed; indifferent to collisions), and a "bad" expert (drives at the slowest speed; tries to hit cars and go off-road). In both cases, the MWAL algorithm leverages information encoded in the features to produce a policy that is significantly better than the expert's policy. We also applied the MWAL algorithm to the "no expert" setting (see Section 5.1). In that case, it produced a policy that drives as fast as possible without risking any collisions or off-roads. Given our features, this is indeed the best policy for the worst-case choice of reward function. Acknowledgments We thank Pieter Abbeel for his helpful explanatory comments regarding his proofs. We also thank the anonymous reviewers for their suggestions for additional experiments and other improvements. This work was supported by the NSF under grant IIS-0325500. References [1] P. Abbeel, A. Ng (2004). Apprenticeship Learning via Inverse Reinforcement Learning. ICML 21 [2] Y. Freund, R. E. Schapire (1996). Game Theory, On-line Prediction and Boosting. COLT 9 [3] M. Kearns, S. Singh (2002). Near-Optimal Reinforcement Learning in Polynomial Time. Machine Learning 49, 209­232. [4] P. Abbeel, A. Ng (2005). Exploration and Apprenticeship Learning in Reinforcement Learning. ICML 22 (Long version; available at http://www.cs.stanford.edu/~pabbeel/) [5] U. Syed, R. E. Schapire (2007). "A Game-Theoretic Approach to Apprenticeship Learning -- Supplement". http://www.cs.princeton.edu/~usyed/nips2007/ 8