Learning All Optimal Policies with Multiple Criteria Leon Barrett Srini Narayanan 1947 Center St. Ste. 600, Berkeley, CA 94704 barrett@icsi.berkeley.edu snarayan@icsi.berkeley.edu Abstract We describe an algorithm for learning in the presence of multiple criteria. Our technique generalizes previous approaches in that it can learn optimal policies for all linear preference assignments over the multiple reward criteria at once. The algorithm can be viewed as an extension to standard reinforcement learning for MDPs where instead of repeatedly backing up maximal expected rewards, we back up the set of expected rewards that are maximal for some set of linear preferences (given by a weight vector, - ). We present the algow rithm along with a proof of correctness showing that our solution gives the optimal policy for any linear preference function. The solution reduces to the standard value iteration algorithm for a specific weight vector, - . w swiftly as preferences vary. We present both an algorithm for the general case of learning all optimal policies under all assignments of linear priorities for the reward components, and a proof showing the correctness of our algorithm. We start with a motivating example of a simple task with multiple rewards in Section 2. The paper then proceeds to the main algorithm in Section 3. We address related work in Section 4, and then Section 5 discusses the complexity of our algorithm including realistic and tractable specializations of our algorithm. Section 6 describes the application of this algorithm to an example domain, and Section 7 discusses extensions to this technique, such as implementations using other RL methods (such as temporal difference methods) and applications of our algorithm to infer another agent's preferences based on observing their behavior. Section 8 outlines the proof of the algorithm's correctness. 1. Introduction In Reinforcement Learning (RL), an agent interacts with the environment to learn optimal behavior. (Sutton & Barto, 1998) Most RL techniques are based on a scalar reward, i.e., they aim to optimize an ob jective that is expressed as a function of a scalar reinforcement. A natural extension to traditional RL techniques is thus the case where there are multiple rewards. In many realistic domains, actions depend on satisfying multiple ob jectives simultaneously (such as achieving performance while keeping costs low, a robot moving efficiently toward a goal while being close to a recharging station, or a government funding both military and social programs). Learning optimal policies in many real-world domains thus depends on the ability to learn in the presence of multiple rewards. However, the resulting policies depend heavily on the preferences over these rewards, and they may change App earing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). 2. Explanation and Motivating Example We assume that instead of getting a single reward signal, the agent gets a reward divided up into several components, a reward vector. That is, we decompose the reward signal r(s, a) (where s is a state and a is an action) into a vector - (s, a) = r [r1 (s, a), r2 (s, a), . . . , rn (s, a)]. An agent could potentially optimize many different functions of these rewards, but the simplest function is a weighted sum: for every fixed weight vector - we obtain a total rew - · - (s, a). There is thus an - ward scalar r (s, a) = w r w optimal policy for each weight vector - . w - w Consider, for example, a lab guinea pig running a familiar maze, shown in Figure 1. The guinea pig runs through the maze to one of four stashes of food. Once it has reached a stash and eaten the food, the experimenter takes it out of the maze and returns it to its cage, so it can only hope to eat one of the stashes per run of the maze. Assume that there are Learning All Optimal Policies with Multiple Criteria 3 r=[0.6, 0.6] r=[1, 0] 1 1 2 3 R2 4 4 r=[0.7, 0.4] r=[0, 1] 2 1 0 Figure 1. An example maze with rewards, split into 2 comp onents, at 3 different locations R1 1 Figure 2. The p otential reward vectors in the guinea pig example only 2 types of food provided (hay and carrot), so reward vectors take the form [hay,carrot]. Location 1 contains hay (- = [1, 0]), location 2 contains carrot r - = [0, 1]), and locations 3 and 4 contain a little of (r both (- = [0.6, 0.6] and [0.7, 0.4], respectively). Ber cause the maze is familiar, the animal knows where the food is placed and what sort of food is in each location. The experimenter has several different guinea pigs and has discovered that each has different tastes. For in stance, Chester likes only hay (- = [1, 0]), and Milo w - = [0, 1]), but greedy Louis likes likes only carrot ( w both equally (- = [0.5, 0.5]). (Without loss of genw erality,i assume that all animals' weight vectors satwi = 1: they describe relative preferences, not isfy absolute utilities.) So, if Chester goes to location 4 (- = [0.7, 0.4]), then he gets reward r = - · - = 0.7. r wr Milo would get 0.4, and Louis would get a reward of 0.55. Looking at the maze, we see that although there are 4 possible strategies (with rewards shown in Figure 2, only 3 of them are optimal for any values of - . One w strategy occurs when the weight vector has w0 > 0.6 (and hence w1 = 1 - w0 < 0.4): then the guinea pig should head straight for location 1, because the reward elsewhere will be no more than 0.6. By the exact same logic, when the weight vector has w1 > 0.6 (and w0 < 0.4), then the animal should go to location 2. In all other cases (0.4 w0 0.6), it will optimize its reward by going to location 3. Under no circumstances would an optimal agent go to location 4! No matter what its weight vector, some other location dominates location 4. We would like to determine exactly this: which policies are viable and which are not (even without w knowing - ). Our method learns the set of optimal policies for all - w at the same time. Once the agent has learned all these policies, it can change reward weights at runtime to get a new optimal behavior, without having to do any relearning. For a fixed priority scheme (fixed weight vector - ) over the multiple reward components, our w algorithm results in the standard recurrence for Qvalues that is analogous to the equation for the average weighted reward case as in (Natara jan & Tadepalli, 2005): I - w r Q (s, a) = E · - (s, a) + max Q (s , a )|s, a - - w w a n the general case, where we do not know the relative priorities over the reward components, our algorithm exploits the fact that the extrema of the set of Q-values vectors (Q vectors that are maximal for some weight setting) is the same as the convex hull of the Q-value vectors. (The convex hull is defined as the smallest convex set that contains all of a set of points. In this case, we mean the points that lie on the boundary of this convex set, which are of course the extreme points­the ones that are maximal in some direction. This is somewhat similar to the Pareto curve, since both are maxima over trade-offs in linear domains.) Now we can rewrite the general RL recurrence in terms of operations on the convex hull of Q-values, and we show this recurrence to be correct and convergent to the value iteration algorithm in the fixed weight vector case. Many standard RL algorithms in the literature can be seen as limiting cases of our more general algorithm. While the worst-case complexity of our general algorithm is exponentially higher than that of fixed-- w - cases but also cases, it not only solves all the fixed- w determines which cases are worth solving. We also give some constraints and techniques that can help reduce the complexity. 3. Convex Hull Value Iteration In this section, we introduce the problem definition in the context of a traditional MDP setting and our approach and algorithm. Learning All Optimal Policies with Multiple Criteria 3.1. Preliminaries and Notation Our approach is based on an MDP which is a tuple - (S, A, T , , ), where S is a finite set of N states, A = r {a1 , . . . , ak } is a set of k actions, T = {Psa (s )} is the set of state transition probabilities (Psa (s ) is the transition probability of going to state s S by taking action a A from state s S ), [0, 1) is the discount factor, and - : S × A Rd is the reward r function giving d-component reward vector - (s, a). r This differs from the standard formulation only in that reward now comes as a vector. A policy, , is the map S A, and the value function for any policy , evaluated at some state si is the vector - r r V (si ) = E[- (si , ai ) + - (si+1 , ai+1 ) + . . . | ] (1) where the expectation is over the distribution of the r r state and reward sequence (si , - i , si+1 , - i+1 , . . .), that is obtained on executing the policy starting from the state si . The Q-function is the vector - ( - - - r 2) Q (s, a) = E(s,a),s Psa (s, a) + V (s ) r Prop osition 1. The convex hul l over Q-values contains the optimal policy over the average expected re ward r(s, a) = - · - (s, a) for any - . wr w To make this operational and derive an algorithm that maintains all optimal policies for any weight vector - , w we need a few definitions for relevant operations on the convex hull. We write Q(s, a) to represent the vertices of the convex hull of possible Q-value vectors for taking action a at state s. We then define the following operations on convex hulls which will be used to construct our learning algorithm. Definition 1. Translation and scaling op erations - + bQ {- + b- : - Q} u u qq (3) Definition 2. Summing two convex hulls Q+U hull{- + - : - Q, - U } q uq u (4) Definition 3. Extracting the Q-value To extract the best Q-value for a given - , we perform a simple w maximum: - Q (s, a) w Q(s,a) - q where - (s, a), s Psa means that the expectation r with respect to s and - (s, a) distributed according r to Psa . The optimal Q function for a weight - is w - - · Q (s, a). Q (s, a) = sup w - w 3.2. Approach: Convex Hulls Given some - , the resulting reward for taking an acw tion is r(s, a) = - · - (s, a). This gives us the followwr ing recurrence for optimal Q-values, which is exactly equivalent to the equation for a single reward component: - W - - Q (s, a) = E · - (s, a) + max Q (s , a )|s, a w r w w a max - ·- wq (5) Given these definitions, we are now ready to illustrate the basic algorithm. 3.3. Convex Hull Value Iteration Algorithm Our algorithm extends the single-- case (which is the w standard expected discounted reward framework (Bellman, 1957)) into the following recurrence: - ( a (s, a) + hull Q(s , a )|s, a Q(s, a) = E r 6) e can solve this recurrence directly, or we can use it to get converging approximations to the optimal value function--this gives rise to the value iteration method, Q-learning, and so on. An alternative view is that each possible policy gives a - different expected reward Q (s, a), and we simply want to select a policy by maximizing the dot product of this - w w with - . For a fixed - , only one such Q (s, a) can be - optimal, but in general we might care about any Q s - . But this set of Q-values that are maximal for some w that are extrema is exactly the convex hull of the Qvalues! This allows us to use standard convex hull operations to pare down the set of points we consider and gives rise to the following proposition. That is, instead of repeatedly backing up maximal expected rewards, we back up the set of expected rewards that are maximal for some - . While the expectation w over hulls looks awkward, it is the natural equivalent of an expectation of maxima, and it arises for the same reason. We must take an expectation over s , but once in s , we can choose the best action, no matter what w our - . The expectation's computation can be broken down, in the usual way, into the scalings and sums we have already defined. This leads us to define Algorithm 1, which extends the value iteration algorithm (Bellman, 1957) to learn optimal Q-values for all possible - . A proof of its w correctness is given in Section 8. Learning All Optimal Policies with Multiple Criteria Algorithm 1 Value iteration algorithm modified from that of Bellman (1957) Initialize Q(s, a) arbitrarily s, a while not converged do for all s S , a A do Q(s, a) E[- (s, a) r a + hull Q(s , a )|s, a] end for end while return Q optimal action based on both the observed state and the continuous beliefs. (Kaelbling et al., 1998) Figure 3. A POMDP formulation of multiple reward comp onents 4. Related Work There is now a body of work addressing multi-reward reinforcement learning. There have been algorithms that assume a fixed ordering between different rewards, such as staying alive and not losing food (Gabor et al., 1998), techniques based on formulating the multiple reward problem as optimizing a weighted sum of the discounted total rewards for multiple reward types (Feinberg & Schwartz, 1995), and techniques that decompose the reward function into multiple components which are learned independently (with a single policy) (Russell & Zimdars, 2003). In all these cases, the preference over rewards is assumed to be fixed and time-invariant. In a slightly more flexible formulation, Mannor and Shimkin (2004) take multiple reward components and perform learning that results in expected rewards lying in a particular region of reward space. More recently, (Natara jan & Tadepalli, 2005) formulate the multiple reward RL problem as we do, using a weighted expected discounted reward framework, and they store both the currently best policy and its Qvalues as vectors. When priorities change dynamically (as reflected in changes in the weight vector), the agent can calculate new reward scalars from the vectors and thus start from the Q-values of the best policy learned so far rather than resetting entirely. As far as we are aware, none of the techniques proposed tackle the general case of learning optimal policies for all linear preference assignments over the multiple reward components. 4.1. Relation to POMDPs Our problem, and hence its solution, is closely related to the standard partially observable Markov decision process (POMDP) formulation. In a POMDP, we have a model of both observed and unobserved variables and use Bayesian reasoning to infer a joint distribution over the hidden variables. Then, we must choose an Consider the POMDP shown in Figure 3; here, the reward depends on an unoibserved multinomial ranP(w = i)ri . If we define dom variable, so E[r] = P(wt |wt-1 ) to be the identity, the distribution of w will not change with t. Then, the expected reward depends linearly on our prior distribution over w, and the dual of the usual POMDP maximum-hyperplane algorithm corresponds to a convex hull operation over reward components. It is thus possible to write our multiple-reward problem as a POMDP problem. This suggests a natural route to extend our algorithm to operate on POMDPs. It remains future work, however, to see if the approximation algorithms used for solving POMDPs can yield useful results in our domain. 5. Complexity This algorithm relies on four convex hull operations, whose complexity we will analyze in terms of the number of points on the hull, n; in the limit, this number converges to the number of optimal policies in the environment. We must both scale (by probabilities and discounts) and translate (by rewards) our convex hulls; these operations only require touching every point once, resulting in a complexity of O(n). We must also merge two or more convex hulls. This takes time at most O((k n)d/2 ) if d > 3, where k is the number of hulls involved, n is the number of points in each hull, and d is the dimension (number of reward components) (Clarkson & Shor, 1989). Finally, we must add two convex hulls. If done naively by adding all pairs of points and taking a hull, this takes time at most O(n2d/2 ). All these operations must be performed whenever we back up Q-values, so we multiply the complexity of ordinary reinforcement learning by O(n2d/2 ). (However, in the d = 2 and d = 3 cases, there are efficient ways to perform these operations.) In the long run, the number of points on each convex hull, n, must converge to a limit as the Q-values con- Learning All Optimal Policies with Multiple Criteria R1 E1 E2 R2 R2 0.6 0.5 0.4 0.3 0.2 0.1 0 1 4 2 0 H 0.1 0.2 0.3 0.4 R1 5 6 3 -0.5 -0.4 -0.3 -0.6 0.5 0.6 0.7 0.8 0 -0.2 -0.1 E Figure 4. A resource-collection domain verge to their optimal values. Eventually, there will be exactly one point on each convex hull for each optimal policy. However, in the short term, the number of short-range policies we might have to track might be much lower or even higher. Also, the number of optimal policies n depends on the environment in a complicated way, with the worst case being that all policies (|A||S | of them) may be optimal for some weight vector. 5.1. Reducing the Complexity The complexity result of our algorithmic modifications is an exponential blowup with the number of reward components. There are a few main ways of tackling this. The first is to simply restrict the number of reward components; with only, say, 5 or fewer, this additional computation is likely not to be an undue burden. In practice, there are currently very few problems studied with more reward components than this. When we must handle a high-dimensional problem, we can reduce the complexity by applying constraints on the weight vectors that we might optimize for. Given the geometric nature of our approach, if we have knowledge about the directions of allowable vectors, such as - · - > 0, then we can simply take a partial aw convex hull. This will, on average, reduce the complexity of the convex hull computation by half. So, if we know that all d elements of - must be positive, w then we can write that as d such constraints to divide the convex hull complexity by 2d . In addition, the convergence of Q-values means that we are essentially performing the same convex hull operations again and again; this means that we might be able to reuse the information from the last iteration. The idea is to annotate each point with a "witness", or proof of its status: if a point is not on the convex hull, then we note down a set of faces that enclose it, and if it is on the hull, we note down a direction in which it is the extremum. Then, on the next iteration, when these points have moved slightly and we must compute a convex hull again, we can simply check these proofs Figure 5. Optimal rewards in the resource-collection domain R2 1 0.8 0.6 0.4 1 4 2 0.2 3 5 6 0.4 0.6 0.8 1 0 0 0.2 (E) R1 Figure 6. Regions of preference space in which p olicies are optimal. Axes are reward comp onents R1 and R2; the enemy weight is E = 1 - R1 - R2. (in at most O(n2 ) time). If all the proofs are correct, then our convex hull remains correct and the locations of the points have moved only slightly. On the other hand, if any proof is violated, we can simply rebuild the convex hull in the ordinary, expensive way. In the limit as the Q-values and policy converge, the policy must stop changing, so this trick may greatly reduce the complexity of refining Q-values. # 1 2 3 4 5 6 p olicy Go directly to R2, dodging Es Go to both Rs, through both Es Go to R1, through E1 both ways Go to both Rs, dodging E1 but through E2 Go to R1, dodging all Es Go to R1, going through E1 only once Table 1. The optimal p olicies for the example domain Learning All Optimal Policies with Multiple Criteria 6. Example Application: Resource Gathering In order to demonstrate the application of this method, we have tested it on a resource-collecting problem similar to that of many strategy games. We model this as a resource-collecting agent moving (in the 4 cardinal directions) around in a grid environment shown in Figure 4, starting from the home base, labelled H. Its goal is to gather resources and take them back to the home base. If it reaches location R1, it then picks up resource 1, and at R2 it gets resource 2; it can carry both at the same time. When the agent returns to H, it receives a reward for each resource it brings back. Also, if it steps on one of the two enemy spaces, labeled E1 and E2, with a 10% probability it will be attacked, receiving a penalty and resetting to the home space, losing all it carries. Its reward space is then [enemy, resource1, resource2], so it can get a penalty of [-1,0,0] for being attacked, or a reward of [0,1,0], [0,0,1], or [0,1,1] for bringing back one or both resources. We use a discounting rate of = 0.9. Depending on the relative values of the resources and attack, the agent may find different policies to be valuable. The convex hull of values starting at H, V (H ), is shown in Figure 5. The points on the hull correspond to optimal policies, described in Table 1; each policy is valid for some range of preferences - , which w 1 are shown in Figure 6. vations of the agent to infer its reward weights. We simply repeatedly observe its choice of action a and use our knowledge of Q(s, a) to identify which values w of - are consistent with that action. Then, we take the intersection of the constraints. The multi-criterion RL approach also allows us to examine reward at different time scales. Instead of having a single discounting factor , we could have a discounting factor i for each component. This allows us to use a sum of exponentials with different time constants to approximate non-exponential discounting rates, which are helpful in explaining the preferences of humans (Ainslie, 2001). With our convex hull method, we can find what policies are optimal for a whole range of discounting rates. 8. Appendix: Proof of Correctness w We prove that - Algorithm 1 gives the optimal policy by reducing the recurrence to the standard value iteration recurrence for any - . First, recall the basic w recurrence of our algorithm, Equation 6. N - a Q(s, a) E (s, a) + hull Q(s , a )|s, a r ow apply Equation 5 to the both sides (to extract the optimal value for - ): w - - · - : - E (s, a) - Q (s, a) max{ w q q r w } a + hull Q(s , a )|s, a . 7. Extensions and Current Work This same convex-hull technique can be used with other RL algorithms, such as the temporal difference learning algorithm. The critical thing to recall is that because we are learning more than one policy at once, we can use only off-policy learning algorithms. Our solution can also be used for inferring the preference function from observation data. This is closely related to the inverse reinforcement learning problem (Ng & Russell, 2000; Abeel & Ng, 2004). The basic idea behind inverse reinforcement learning is to use observed behavior to infer weights from a user that can then be used to find optimal policies. In our case, the method for learning all policies at once can also be used in reverse to learn the range of reward weights that an agent must have. If we assume that an agent we observe is rational and uses a policy that is optimal for its reward weights, then we can use our obserWe do not show the ranges of p olicies optimal where the values of the rewards are less than 0 (wi < 0); these p olicies, while sometimes interesting, are not valuable for the task. 1 Next, apply the definition of an expectation max s - - {- · - : - wqq ,(s,a) P(s , r (s, a)|s, a) r - } a · (s, a) + hull Q(s , a ) , r then use Equations 3 and 4 and rewrite i max{- · - : - hull wqq - : r qi P(si , - (s, a)|s, a) (s, a) + - s r - ,(s,a) r - hull q s1 a Q(s , a ), . . . 1 } . Learning All Optimal Policies with Multiple Criteria -i max · w P(s , - (s, a)|s, a) ir - ,(s,a) r - : · (s, a) + - si r q a - hull Q(s , a ), . . . q s1 1 Given a - , at any point in the algorithm, this gives w the same Q-value as ordinary value iteration. Therefore, the proof of convergence for the value iteration algorithm applies to our method, and our method converges exactly as quickly as ordinary value iteration (for every - ). w References Abeel, P., & Ng, A. (2004). Apprentice learning via inverse reinforcement learning. Proc. ICML-04. Ainslie, G. (2001). Breakdown of wil l. Cambridge, Massachusetts: Cambridge University Press. . Bellman, R. E. (1957). Dynamic programming. Princeton: Princeton University Press. Clarkson, K. L., & Shor, P. W. (1989). Applications of random sampling in computational geometry, II. Discrete and Computational Geometry, 4, 387­421. Feinberg, E., & Schwartz, A. (1995). Constrained markov decision models with weighted discounted rewards. Mathematics of Operations Research, 20, 302­320. Gabor, Z., Kalmar, Z., & Szepesvari, C. (1998). Multicriteria reinforcement learning. Proc. ICML-98. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intel ligence. Mannor, S., & Shimkin, N. (2004). A geometric approach to multi-criterion reinforcement learning. Journal of Machine Learning Research, 325­360. Natara jan, S., & Tadepalli, P. (2005). Dynamic preferences in mult-criteria reinforcement learning. Proc. ICML-05. Bonn, Germany. Ng, A., & Russell, S. (2000). Algorithms for inverse reinforcement learning. Proc. ICML-00. Russell, S., & Zimdars, A. (2003). Q-decomposition for reinforcement learning agents. Proc. ICML-03. Washington, DC. Sutton, R., & Barto, A. (1998). Reinforcement learning: An introduction. Cambridge, Massachusetts: The MIT Press. s E + - · - (s, a) , a max wr i P(s |s, a)- · - s : w qi i - hull q s1 a Q(s , a ), . . . 1 Pull - (s, a) (added independently to the entire set) r and (non-negative) out of the maximum. E[- · - (s, a)|s, a] wr i w qi P(s |s, a)- · - s : + max i B - hull q s1 a Q(s , a ), . . . 1 ut the max of a sum over different sets is the sum of the sets' maxima, which we simplify. E[- · - (s, a)|s, a] + wr - i P(si |s, a) max · - s : w i q - hull q si a Q(s , a ) i E[- · - (s, a)|s, a] wr - i P(s |s, a) max · - s : w i q + i - Q(s , a ), a A(s ) q si i i E[- · - (s, a)|s, a] wr i P(s |s, a) max + i a max qs Q(s ,a ) i i - · - w q si But we re-order the maxima and rewrite an expecta tion, and so we recover our recurrence for a single - . w s. - - - Q (s, a) E · - (s, a) + max Q (s , a ) , a w r w w a