Convergence, Targeted Optimality, and Safety in Multiagent Learning

Doran Chakraborty chakrado@cs.utexas.edu Peter Stone pstone@cs.utexas.edu Department of Computer Science, University of Texas, 1 University Station C0500, Austin, Texas 78712, USA

Abstract
This paper introduces a novel multiagent learning algorithm, Convergence with Model Learning and Safety (or CMLeS in short), which achieves convergence, targeted optimality against memory-bounded adversaries, and safety, in arbitrary repeated games. The most novel aspect of CMLeS is the manner in which it guarantees (in a PAC sense) targeted optimality against memory-bounded adversaries, via efficient exploration and exploitation. CMLeS is fully implemented and we present empirical results demonstrating its effectiveness.

gorithms, emphasizing what behavior they will converge to against various types of opponents,1 in such settings. The contribution of this paper is that it proposes a novel MAL algorithm, CMLeS, that for a multi-player multi-action (arbitrary) repeated game, achieves the following three goals: 1. Convergence : converges to playing a Nash equilibrium in self-play (other agents are also CMLeS); 2. Targeted Optimality : in expectation, converges to achieving close to the best response with a high probability, in tractable number of steps, against a set of memory-bounded, or adaptive,2 opponents whose memory size is upper bounded by a known value Kmax . The same guarantee also holds for opponents which eventually become memory-bounded. 3. Safety : converges to playing the maximin strategy against every other opponent (apart from a very specific one) which cannot be approximated as a Kmax memory-bounded opponent. The only opponent against which CMLeS provides no guarantee, is the one which has the exact knowledge of CMLeS's decision function and hence can plan to fool it. 1.1. Related work Bowling et al. (Bowling & Veloso, 2001) were the first to put forth a set of criterion for evaluating multiagent learning algorithms. In games with two players and two actions per player, their algorithm WoLF-IGA converges to playing best response against stationary, or memoryless, opponents (rationality), and converges to playing the Nash equilibrium in self-play (convergence). Subsequent approaches extended the rationality and convergence criteria to arbitrary (multi-player, multi-action) repeated games (Banerjee & Peng, 2004; Conitzer & Sandholm, 2006). Amongst them, Awe1 Although we refer to other agents as opponents, we mean any agent (cooperative, adversarial, or neither) 2 Consistent with the literature (Powers et al., 2005), we call memory-bounded opponents as adaptive opponents.

1. Introduction
In recent years, great strides have been made towards creating autonomous agents that can learn via interaction with their environment. When considering just an individual agent, it is often appropriate to model the world as being stationary, meaning that the same action from the same state will always yield the same (possibly stochastic) effects. However, in the presence of other independent agents, the environment is not stationary: an action's effects may depend on the actions of the other agents. This non-stationarity poses the primary challenge of multiagent learning (MAL) (Bu¸oniu et al., 2008) and comprises the main s reason that it is best considered distinctly from single agent learning. The simplest, and most often studied, MAL scenario is the stateless scenario in which agents repeatedly interact in the stylized setting of a matrix game (a.k.a. normal form game). In the multiagent literature, various criteria have been proposed to evaluate MAL alAppearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Convergence, Targeted Optimality, and Safety in Multiagent Learning

some (Conitzer & Sandholm, 2006) achieves convergence and rationality in arbitrary repeated games without requiring agents to observe each others' mixed strategies. However, none of the above algorithms have any guarantee about the payoffs achieved when they face arbitrary non-stationary opponents. More recently, Powers et al. proposed a newer set of evaluation criteria that emphasizes compatibility, targeted optimality and safety (Powers & Shoham, 2005). Compatibility is a stricter criterion than convergence as it requires the learner to converge within  of the payoff achieved by a Pareto optimal Nash equilibrium. Their proposed algorithm, PCM(A) (Powers et al., 2007) is, to the best of our knowledge, the only known MAL algorithm to date that achieves compatibility, safety and targeted optimality against adaptive opponents in arbitrary repeated games. 1.2. Contributions CMLeS improves on Awesome by guaranteeing both safety and targeted optimality against adaptive opponents. It improves upon PCM(A) in five ways. 1. The only guarantees of optimality against adaptive opponents that PCM(A) provides are against the ones that are drawn from an initially chosen target set. In contrast, CMLeS can model every adaptive opponent whose memory is bounded by Kmax . Thus it does not require a target set as input: its only input is Kmax , an upper bound on the memory size of adaptive opponents that it is willing to model and exploit. 2. Once convinced that the other agents are not self-play agents, PCM(A) achieves targeted optimality against adaptive opponents by requiring all feasible joint histories of size Kmax to be visited a sufficient number of times. Kmax for PCM(A) is the maximum memory size of any opponent from its target set. CMLeS significantly improves this by requiring a sufficient number of visits to only all feasible joint histories of size K, the true opponent's memory size. 3. Unlike PCM(A), CMLeS promises targeted optimality against opponents which eventually become memory-bounded with K  Kmax . 4. PCM(A) can only guarantee convergence to a payoff within  of the desired Nash equilibrium payoff with a probability . In contrast, CMLeS guarantees convergence in self-play with probability 1. 5. CMLeS is relatively simple in its design. It tackles the entire problem of targeted optimality and safety by running an algorithm that implicitly achieves either of the two, without having to reason separately about adaptive and arbitrary opponents. The remainder of the paper is organized as follows.

Section 2 presents background and definitions, Section 3 and 4 presents our algorithm, Section 5 presents empirical results and Section 6 concludes.

2. Background and Concepts
This section reviews the definitions and concepts necessary for fully specifying CMLeS. A matrix game is defined as an interaction between n agents. Without loss of generality, we assume that the set of actions available to all the agents are same, i.e., A1 = . . . = An = A. The payoff received by agent i during each step of interaction is determined by a utility function over the agents' joint action, ui : An  . Without loss of generality, we assume that the payoffs are bounded in the range [0,1]. A repeated game is a setting in which the agents play the same matrix game repeatedly and infinitely often. A single stage Nash equilibrium is a stationary   strategy profile {i , . . . , n } such that for every agent i and for every other possible stationary strategy i ,    the following inequality holds: E(1 ,...,i ,...,n ) ui (·)    E(1 ,...,i ,...,n ) ui (·). It is a strategy profile in which no agent has an incentive to unilaterally deviate from its own share of the strategy. A maximin strategy for an agent is a strategy which maximizes its own minimum payoff. It is often called the safety strategy, because resorting to it guarantees the agent a minimum payoff. An adaptive opponent strategy looks back at the most recent K joint actions played in the current history of play to determine its next stochastic action profile. K is referred to as the memory size of the opponent.3 The strategy of such an opponent is then a mapping,  : An K  A. If we consider opponents whose future behavior depends on the entire history, we lose the ability to (provably) learn anything about them in a single repeated game, since we see a given history only once. The concept of memory-boundedness limits the opponent's ability to condition on history, thereby giving us a chance to learning its policy. We now specify what we mean by playing optimally against adaptive opponents. For notational clarity, we denote the other agents as a single agent o. It has been shown previously (Chakraborty & Stone, 2008) that the dynamics of playing against such an o can be modeled as a Markov Decision Process (MDP) whose transition probability function and reward function are determined by the opponents' (joint) strategy . As the MDP is induced by an adversary, this setting is called an Adversary Induced MDP, or AIM in short.
K is the minimum memory size that fully characterizes the opponent strategy.
3

Convergence, Targeted Optimality, and Safety in Multiagent Learning

An AIM is characterized by the K of the opponent which induces it: the AIM's state space is the set of all feasible joint action sequences of length K. By way of example, consider the game of Roshambo or rockpaper-scissors (Figure 1) and assume that o is a single agent and has K = 1, meaning that it acts entirely based on the immediately previous joint action. Let the current state be (R, P ), meaning that on the previous step, i selected R, and o selected P . Assume that from that state, o plays actions R, P and S with probability 0.25, 0.25, and 0.5 respectively. When i chooses to take action S in state (R, P ), the probabilities of transitioning to states (S, R), (S, P ) and (S, S) are then 0.25, 0.25 and 0.5 respectively. Transitions to states that have a different action for i, such as (R, R), have probability 0. The reward obtained by i when it transitions to state (S, R) is -1 and so on. The optimal policy of the MDP associated with the AIM is the optimal policy for playing against o. A policy that achieves an expected return within  of the expected return achieved by the optimal policy is called an -optimal policy (the corresponding return is called -optimal return). If  is known, then we can have computed the optimal policy (and hence optimal policy) by doing dynamic programming (Sutton & Barto, 1998). However, we do not assume that  or even K are known in advance: they need to be learned in online play. We use the discounted payoff criterion in our computation of an -optimal policy, with  denoting the discount factor. Finally, it is important to note that there exist opponents in the literature which do not allow convergence to the optimal policy once a certain set of moves have been played. For example, the grim-trigger opponent in the well-known Prisoner's Dilemma (PD) game, an opponent with memory size 1, plays cooperate at first, but then plays defect forever once the other agent has played defect once. Thus, there is no way of detecting its strategy without defecting, after which it is impossible to recover to the optimal strategy of mutual cooperation. In our analysis, we constrain the class of adaptive opponents to include only those which do not negate the possibility of convergence to optimal exploitation, given any arbitrary initial sequence of exploratory moves (Powers & Shoham, 2005). Equipped with the required concepts, we are now ready to specify our algorithms.

R R P S
0,0 1,-1 -1,1

P
-1,1 0,0 1,-1

S 0.25
1,-1 -1,1 0,0

R P S (S,R)

(R,P)
0.25 0.25 0.5

(R,P)

0.25

0.5
R - Rock, P - Paper, S - Scissor

(S,P)

(S,S)

Roshambo

Opponent Strategy

Partial Transition Function for state (R,P) and action S

Figure 1. Example of AIM

optimality against adaptive opponents and safety. 3.1. Overview MLeS begins with the hypothesis that the opponent is an adaptive opponent (denoted as o) with an unknown memory size K, that is bounded above by a known value Kmax . MLeS maintains a model for each possible value of o's memory size, from k = 0 to Kmax , plus one additional model for memory size Kmax +1. k Each model k is a mapping (Ai × Ao )  A rep^ resenting a possible o strategy. k is the maximum ^ likelihood distribution based on the observed actions played by o for each joint history of size k encountered. Henceforth we will refer to a joint history of size k as sk and the empirical distribution captured by k for ^ sk as k (sk ). k (sk , ao ) will denote the probability as^ ^ signed to action ao , by k (sk ). When a particular sk is ^ encountered and the respective o's action in the next step is observed, the empirical distribution k (sk ) is ^ updated. Such updates happen for every k , on ev^ ery step. For every sk , MLeS maintains a count value v(sk ), which is the number of times sk has been visited. The operations performed by MLeS on each step can be summarized as follows: 1. Update all models based on the past step. 2. Determine kbest which is the best memory size that describes o's behavior. In order to do so, it makes a call to the Find-K algorithm. If Find-K fails to determine a kbest , it sets it to -1. 3. If kbest = -1, take a step towards solving the reinforcement learning (RL) problem for the AIM induced by kbest . Otherwise, play the maximin strategy. Of these three steps, step 2 is by far the most complex. We present how MLeS addresses it next. 3.2. Find-K algorithm The objective of Find-K is to return the best estimate of o's memory size (kbest ) on every time step. From a high level, it does so by comparing models of increasing size to determine at which point the larger models cease to become more predictive of o's behavior. We begin by proposing a metric called k , which is

3. Model learning with Safety (MLeS)
In this section, we introduce a novel algorithm, Model Learning with Safety (MLeS), that ensures targeted

Convergence, Targeted Optimality, and Safety in Multiagent Learning

an estimate of how much models k and k+1 differ ^ ^ from each other. But, first, we introduce two notations that will be instrumental in explaining the metric. We denote (ai , ao ) · sk to be a joint history of size k+1, that has sk as its last k joint actions and (ai , ao ) as the last k+1'th joint action. For any sk , we define a set Aug(sk ) = ai ,ao Ai ×Ao ((ai , ao ) · sk |v((ai , ao ) · sk ) > 0). In other words Aug(sk ) contains all joint histories of size k+1 which have sk as their last k joint actions and have been visited at least once. k is then defined as maxsk ,sk+1 Aug(sk ),ao A |^k (sk , ao ) -  k+1 (sk+1 , ao )|. k is thus the maximum difference ^ in prediction of the models k and k+1 . ^ ^ Based, on the concept of k , we make a couple of crucial observations that will come in handy for our theoretical claims made later in this subsection. Observation 1. For all K  k  Kmax and for any k sized joint history sk and any sk+1  Aug(sk ), E(^k (sk )) = E(^k+1 (sk+1 )). Hence E(k ) = 0.   Let, sK be the last K joint actions in sk and sk+1 . k (sk ) and k+1 (sk+1 ) represent draws from the same ^ ^ fixed distribution (sK ). So, their expectations will always be equal to (sK ). This is because o just looks at the most recent K joint actions in its history, to decide on its next step action. Observation 2. Once every joint history of size K has been visited at least once, E(K-1 )   > 0, where  is the lower bound of the extent to which K-1 ^ can approximate . Intuitively, Observation 2 follows from the reasoning that K-1 cannot fully represent  without losing ^ some information. We illustrate this with a simple example. Assume that o is of memory size 1 (K = 1) and let A = {a, b}, i.e., each player has just two actions, a and b. Let the probabilities assigned by o to action a for the possible 1 step joint histories (a, a), (a, b), (b, a) and (b, b) be 0.2, 0.3, 0.3 and 0.7 respectively. Now the probability assigned to action a by the 0 step model 0 (, a) can only be a linear combina^ tion of these values, where the coefficients come from the number of visits made to each of these joint histories. Once every 1-step joint history has been visited once, 0 (, 0) can lie anywhere between 0.2 and 0.7. ^ Since E(^1 ((a, a), a)) = 0.2, . . . , E(^1 ((b, b), a)) = 0.7,   it is evident that E(0 ) is lowest when E(^0 (, a)) is  0.2+0.7 = 0.45. Hence  = 0.25 in this case. 2 High-level idea of Find-K: We denote the current values of k and k at time t, as k and t respec^ ^t k tively. The approach taken by Find-K (Alg. 1) can be broadly divided into two steps: line 2: For each time step t, compute values t and k

t k , for all 0  k  Kmax . For the time being, assume t that the k 's computed always satisfy the condition: t K  k  Kmax : P r(t  k )   k

(1)

where  is a very small probability value. In other words, even without the knowledge of K, we want the difference between two consecutive models of size k t and k+1 where k  K to be less than k with a high probability of at least 1 - . Note that although we t compute a k for every 0  k  Kmax , the guarantee from Inequality 1 only holds for K  k  Kmax . We t will soon show how we compute the k 's. lines 3 -12 : Then, iterate over values of k starting from 0 to Kmax and choose the minimum k s.t. for all t k  k   Kmax , the condition t  < k is satisfied. k Finally return that value as kbest . Next we show that eventually the kbest returned by Find-K is K with a high probability. We start by providing an intuitive justification for it, we will prove it later when we specify Lemma 3.1. We begin by showing that eventually Find-K will reject a k < K as a possible value for kbest , with a high probability. With more samples, t K-1 will tend to a positive value   with a high probability (from Observation 2). This coupled with the t fact that K-1 assumes a value lower than  eventually (once the condition stated in Lemma 3.1 is met), t makes K-1 < t . This is a sufficient condition K-1 for Find-K to keep rejecting all k < K as a possible candidate for kbest (steps 6-8). Next we show that, once every k < K keeps getting rejected consistently, Find-K will select K as a possible value for kbest with a high probability of at least 1 - (Kmax - K + 1) (the proof follows from Inequality 1 and using Union bound over all K  k  Kmax ). k > K is only considered for selection once K gets rejected. The latter can only happen with a probability of at most (Kmax - K + 1) which is a very small value. Thus Find-K will converge to selecting K as kbest with a high probability eventually. We now address the final part of Find-K that we have t yet to specify: setting the k 's (step 2). t Choosing k : In the computation of t , MLeS k chooses a specific st from the set of all possible joint k t histories of size k, a specific st k+1 from Aug(sk ) and t t t ^ ^ an action ao , for which the models k and k+1 differ maximally on that particular time step. So,
t t t < k  |^k (st , at ) - k+1 (st , at )| < k t k o ^t k k+1 o

(2)

t The goal will be to select a value for k s.t. Inequality 1 is always satisfied. Hence the rest of the derivation

Convergence, Targeted Optimality, and Safety in Multiagent Learning Algorithm 1: Find-K output : kbest 1 kbest  -1 t t 2 for all 0  k  Kmax , compute k and k 3 for 0  k  Kmax do 4 f lag  true 5 for k  k  Kmax do t if t   k then 6 k 7 f lag  false 8 break
9 10 11 12

set as above. Note that, v(st ) is the number of visits k+1 t to the specific st k+1 chosen for the computation of k . Theoretical underpinnings: Now, we state our main theoretical result regarding Find-K. Lemma 3.1. After all feasible joint histories of size K 4 8 have been visited 2 ln(  ) times, then with probability at least 1 - (Kmax + 2), the kbest returned by FindK is K.  is the lower bound on the degree to which K-1 can approximate , and  is the small probability ^t value from Inequality 1. Proof Sketch: We have already shown that (i) once the choice boils down to selecting a kbest from the range [K, Kmax ], K is selected with a high probability of at least 1-(Kmax -K +1). Now, we show that after all feasible joint histories of size K have been visited the number of times specified in the definition of the lemma, the probability of rejecting a kbest < K is at least 1 - . To reject a k < K, it is sufficient to have t K-1  t K-1 . By using Hoeffding's inequality, we can show t t t that t K-1  E(K-1 ) - K-1   - K-1 , with probability of error at most . Therefore t K-1  8 4 t t K-1 = K-1    v(st )  2 ln(  ) (the last k+1 2 step follows from using Equation 6). Hence (ii) when 8 4 all joint histories of size K are visited  2 ln(  ) times, the probability of rejecting a k < K is at least 1 - . By combining (i) and (ii) we have the proof.

if f lag then kbest  k break return kbest

will focus on the range K  k  Kmax . We can then rewrite Inequality 2 as,  |(^k (st , at ) - E(^k (st , at )) - (^k+1 (st , at ) t k o t k o t k+1 o t t t t -E(^k+1 (sk+1 , ao ))| < k  (3)

The above step follows from using E(^k (st , at )) = t k o t t t E(^k+1 (sk+1 , ao )) (Observation 1). One way to satisfy  t k o Inequality 3 is to have both |^k (st , at )-E(^k (st , at ))| t k o and |^k+1 (st , at ) - E(^k+1 (st , at ))| be < 2k . By t t k+1 o k+1 o upper bounding the probabilities of failure of the above 2 events by  and then using Union bound, we get 2 t P r(t < k ) > 1 - . k Also, we observe that the following holds : P r(|^k+1 (st , at ) - E(^k+1 (st , at ))|  t t k+1 o k+1 o
t k ) 2 t = P r(|^k (st , at ) - E(^k (st , at ))|  k )  t k o t k o 2 t

The onus now lies on the action selection mechanism  (4) (step 3 of MLeS) to ensure that every feasible K sized 2 history gets visited the number of times specified in  (5) Lemma 3.1, which will enable Find-K to keep return2 ing K consistently. This can be derived by applying Hoeffding's inequality and using v(st )  v(st ). v(st )  v(st ) 3.3. Action selection k k+1 k k+1 because the number of visits to a joint history sk On each time step, the action selection mechanism demust be at least the number of visits to any memcides on what action to take for the ensuing time step. ber from Aug(sk ). Intuitively, if we are confident that If the kbest returned is -1, it plays the maximin stratwe have learned a bigger model to a reasonable apegy. If kbest = -1, the action selection strategy picks proximation, then we are also confident with at least the AIM associated with opponent memory kbest and the same confidence that we have learned a smaller takes the next step in the reinforcement learning probmodel to the same approximation. So, Inequality 4 lem of computing a near-optimal policy for that AIM. t = P r(t < k ) > 1 -  k In order to solve this RL problem, we use the model t The problem now boils down to selecting a suitable k based RL algorithm R-Max (Brafman & Tennenholtz, s.t. Inequality 4 is satisfied. By applying Hoeffding's 2003). We maintain a separate instantiation of the t inequality and solving for k , we get, R-Max algorithm for each of the possible Kmax +1 AIMs pertaining to the possible memory sizes of o, 4 2 i.e, M0 , M1 , . . . , MKmax . On each step, based on the t ln( )) (6) k = ( t kbest returned, the R-Max instance for the AIM Mkbest v(sk+1 )  is selected to take an action. The two conditions that t So in general, for each k  [0, Kmax ], the k value is ensure targeted optimality against adaptive opponents

Convergence, Targeted Optimality, and Safety in Multiagent Learning

are then: 1. With probability histories of size K are Once the above criterion is satisfied, Find-K keeps re turning kbest = K with probability at least 1- 3 (from  Lemma 3.1 and by setting  = 3(Kmax +2) ).  2. With probability at least 1 - 3 , allow R-Max instance of MK to converge to achieving an -optimal return. Our ultimate goal is to have our action selection mechanism implicitly satisfy both the conditions in sample 1 1 complexity polynomial in 1 , 1 ,  , NK , |A| and 1- .   1 NK is the number of joint histories of size K and 1- is the horizon time for the discounted return. By the sample complexity property of R-Max, condition 2 will always be satisfied in sample complexity polynomial in the above quantities. Hence, all we need to ensure is that condition 1 remains satisfied in sample complexity polynomial in the above desired quantities. It can be shown that our action selection mechanism implicitly satisfies both the conditions in sample complexity polynomial in the desired quantities, by rerunning each R-Max instance in phases, where in each phase the value of  is decremented by some constant,  being initialized to an arbitrary value (0.1 and 0.6 respectively in our experiments) 4 . A phase for an AIM instance Mk ends when the times that instance was chosen by the action selection mechanism to decide on an action, is some polynomial dependent on the 1 current  value for that instance, 1 , 1 , Nk , |A| and   1 (refer (Brafman & Tennenholtz, 2003) for the ex1- act term). Moreover, when a R-Max instance enters a new phase, all other R-Max instances are also forced to enter a new phase, with their corresponding  values decremented. This technique facilitates faster exploration in the initial stages where the job is to satisfy condition 1 as quickly as possible. Once condition 1 has been satisfied, MK is repeatedly selected by the action selection mechanism with a very high probability. Since other AIMs are selected very rarely from then onwards, their behavior has very little impact on MK . Eventually at some phase (if not already),  for MK gets assigned a value just below the real value and MLeS behaves near optimally from that phase onwards. To conclude, what we have shown is that MLeS converges to a policy that ensures an -optimal return, and plays that policy with a high probability 1- on each step. It is important to note that acting in this
4

 at least 1 - 3 , ensure that all 8 visited 2 ln( 12(Kmax +2) ) times. 

fashion does not actually guarantee an expected return greater than 1- times the -optimal return, because the transition when acting sub-optimally could be to a state from which the best return possible is 0. To compensate for this possibility, we must make  small enough that this worst-case outcome is offset by the low probability of its occurrence. Fortunately, we can characterize the maximum possible loss as a function  of , which is given by L() = (1-)(1-(1-)) . Clearly L() is a decreasing function w.r.t 1 -  and assumes low values for low values of . This brings us to our main theorem regarding MLeS. Theorem 3.2. For any arbitrary  > 0 and  > 0, with probability at least 1-, MLeS in expectation achieves at least within  + L() of the best response against any adaptive opponent, in number of time steps 1 1 polynomial in 1 , 1 ,  , NK , |A| and 1- .   MLeS can model any adaptive opponent of memory size bounded by Kmax . Against an arbitrary o, our claims rely on o not behaving as a Kmax adaptive opponent in the limit. MLeS will then by default converge to playing the maximin strategy, ensuring safety.

4. Convergence and Model learning with Safety (CMLeS)
In this section we build on MLeS to introduce a novel MAL algorithm for an arbitrary repeated game which achieves safety, targeted optimality, and convergence, as defined in Section 1. We call our algorithm, Convergence with Model Learning and Safety: (CMLeS). CMLeS begins by testing the opponents to see if they are also running CMLeS (self-play); when not, it uses MLeS as a subroutine. 4.1. Overview CMLeS (Alg. 2) can be tuned to converge to any Nash equilibrium of the repeated game in self-play. Here, for the sake of clarity, we present a variant which converges to the single stage Nash equilibrium. This equilibrium also has the advantage of being the easiest of all Nash equilibria to compute and hence has historically been the preferred solution concept in multiagent learning (Bowling & Veloso, 2001; Conitzer & Sandholm, 2006). The extension of CMLeS to allow for convergence to other Nash equilibria is straightforward, only requiring keeping track of the probability distribution for every conditional strategy present in the specification of the equilibrium. Steps 1 - 2: Like Awesome, we assume that all agents have access to a Nash equilibrium solver and they compute the same Nash equilibrium profile.

similar to R-Max with unknown mixing time

Convergence, Targeted Optimality, and Safety in Multiagent Learning

Steps 3 - 4: The algorithm maintains a null hypothesis that all agents are playing equilibrium (AAP E). The hypothesis is not rejected unless the algorithm is certain with probability 1 that the other agents are not playing CMLeS.  keeps count of the number of times the algorithm reaches step 4. Steps 5 - 8 (Same as Awesome): Whenever the algorithm reaches step 5, it plays the equilibrium strategy for a fixed number of episodes, N . It keeps a running estimate of the empirical distribution of actions played by all agents, including itself, during this run. At step 8, if for any agent j, the empirical distribution  difj  fers from j by at least  , AAP E is set to false. The e CMLeS agent has reason to believe that j may not be playing the same algorithm. The  and N values for e each  are assigned in a similar fashion to Awesome (Definition 4 of (Conitzer & Sandholm, 2006)). Steps 10 - 20: Once AAP E is set to false, the algorithm goes through a series of steps in which it checks whether the other agents are really CMLeS agents. The details are explained below when we describe the convergence properties of CMLeS (Theorem 4.1). Step 22: When the algorithm reaches here, it is sure (probability 1) that the other agents are not CMLeS agents. Hence it switches to playing MLeS. 4.2. Theoretical underpinnings We now state our main convergence theorems. Theorem 4.1. CMLeS satisfies both the criteria of targeted optimality and safety. Proof. To prove the theorem, we need to prove: 1. For opponents not themselves playing CMLeS, CMLeS always reaches step 22 with some probability. The only opponent against which CMLeS provides no guarantee, is the one which has exact knowledge of CMLeS's decision function and hence can plan to fool it; 2. There exists a value of  , for and above which, the above probability is at least ; Proof of 1. We utilize the property that a K adaptive opponent is also a Kmax adaptive opponent (see Observation 1). The first time AAP E is set to false, it selects a random action ao and then plays it Kmax +1 times in a row. The second time when AAP E is set to false, it plays ao , Kmax times followed by a different action. If the other agents have behaved identically in both of the above situations, then CMLeS knows : 1) either the rest of the agents are playing CMLeS, or, 2) if they are adaptive, they play stochastically for a Kmax bounded memory where all agents play ao . The latter observation comes in handy below. Henceforth, whenever AAP E is set to false, CMLeS always plays ao , Kmax +1 times in a row. Since a non-CMLeS opponent must be stochastic (from the above observation),

Algorithm 2: CMLeS input : n,  = 0 1 for j  {1, 2, . . . , n} do  2 j  ComputeNashEquilibriumStrategy()
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

AAP E  true while AAP E do for N rounds do  Play self for each agent j update  j
 recompute AAP E using the  's and j 's j if AAP E is false then if  = 0 then Play ao , Kmax +1 times

else if  = 1 then Play ao , Kmax times followed by a random action other than ao else Play ao , Kmax +1 times if any other agent plays differently then AAP E  f alse else AAP E  true   +1 Play MLeS

at some point of time, it will play a different action on the Kmax +1'th step with a non-zero probability. CMLeS then rejects the null hypothesis that all other agents are CMLeS agents and jumps to step 22. Proof of 2. This part of the proof follows from Hoeffding's inequality. CMLeS reaches step 22 with a 1 probability at least  in  polynomial in  and ln( 1 ),  where  is the maximum probability that every other agent assigns to any action other than ao for a recent Kmax joint history of all agents playing ao . Theorem 4.2. In self-play, CMLeS converges to playing the Nash equilibrium of the repeated game, with probability 1. The proof follows from the corresponding proof for Awesome (Theorem 3 of Conitzer et al., 2006).

5. Results
We now present empirical results that supplement the theoretical claims. We focus on how efficiently CMLeS models adaptive opponents in comparison to existing algorithms, PCM(A) and Awesome. For CMLeS, we set  = 0.1,  = 0.01 and Kmax = 10. To make the comparison fair with PCM(A), we use the same values of  and  and always include the respective opponent in the target set of PCM(A). We also add an adaptive strategy with K = 10 to the target set of PCM(A), so that it needs to explore joint histories of size 10.

Convergence, Targeted Optimality, and Safety in Multiagent Learning

We use the 3-player Prisoner's Dilemma (PD) game as our representative matrix game. The game is a 3 player version of the N-player PD present in GAMUT.5 The adaptive opponent strategies we test against are : 1. Type 1: every other player plays defect if in the last 5 steps CMLeS played defect even once. Otherwise, they play cooperate. The opponents are thus deterministic adaptive strategies with K = 5. 2. Type 2: every other player behaves as type-1 with 0.5 probability, or else plays completely randomly. In this case, the opponents are stochastic with K = 5. The total number of joint histories of size 10 in this case is 810 , which makes PCM(A) highly inefficient. However, CM LeS quickly figures out the true K and converges to optimal behavior in tractable number of steps. Figure 2 shows our results against these two types of opponents. The Y-axis shows the payoff of each algorithm as a fraction of the optimal payoff achievable against the respective opponent. Each plot has been averaged over 30 runs to increase robustness. Against type-1 opponents (Figure 2(i)), CMLeS figures out the true memory size in about 2000 steps and converges to playing near optimally by 16000 episodes. Against type-2 opponents (Figure 2(ii)), it takes a little longer to converge to playing near optimally (about 30000 episodes) because in this case, the number of feasible joint histories of size 5 are much more. Both Awesome and PCM(A) perform much worse. PCM(A) plays a random exploration strategy until it has visited every possible joint history of size Kmax , hence it keeps getting a constant payoff during this whole exploration phase. Due to space constraints we skip the results for convergence and safety. The convergence part of CMLeS uses the framework of Awesome and the results are exactly similar to it.

1 0.75 PayOff 0.5 0.25 0 0 5000 10000 15000 20000

Episode (Against Trigger Strategy)

1 0.75 PayOff 0.5 0.25 0 0 CMLeS 20000 TPCM(A) 40000 60000 80000

Episode (Against 50 % Random and 50 % Trigger Strategy) AWESOME

Figure 2. Against adaptive opponents

are adaptive. Our ongoing research agenda includes improving CMLeS to have better performance guarantees against arbitrary mixes of agents, i.e., some adaptive, some self-play, and the rest arbitrary.
Acknowledgments: This work took place at the Learning Agents Research Group (LARG) at UT, Austin. LARG research is supported in part by grants from the National Science Foundation (IIS0917122), ONR (N00014-09-1-0658), DARPA (FA8650-08-C-7812), and the Federal Highway Administration (DTFH61-07-H-00030).

References
Banerjee, Bikramjit and Peng, Jing. Performance bounded reinforcement learning in strategic interactions. In AAAI, pp. 2­7, 2004. Bowling, Michael and Veloso, Manuela. Convergence of gradient dynamics with a variable learning rate. In ICML, pp. 27­34, 2001. Brafman, Ronen I. and Tennenholtz, Moshe. R-max - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res., pp. 213­231, 2003. Bu¸oniu, L., Babuka, R., and De Schutter, B. A coms s prehensive survey of multi-agent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, pp. 156­172, 2008. Chakraborty, Doran and Stone, Peter. Online multiagent learning against memory bounded adversaries. In ECML, pp. 211­226, 2008. Conitzer, Vincent and Sandholm, Tuomas. Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. In J. Mach. Learn. Res., pp. 23­43, 2006. Powers, Rob and Shoham, Yoav. Learning against opponents with bounded memory. In IJCAI, pp. 817­822, 2005. Powers, Rob, Shoham, Yoav, and Vu, Thuc. A general criterion and an algorithmic framework for learning in multi-agent systems. Mach. Learn., pp. 45­76, 2007. Sutton, Richard S. and Barto, Andrew G. Reinforcement Learning. MIT Press, 1998.

6. Conclusion and Future Work
In this paper, we introduced a novel MAL algorithm, CMLeS, which in an arbitrary repeated game, achieves convergence, targeted-optimality against adaptive opponents, and safety. One key contribution of CMLeS is in the manner it handles adaptive opponents: it requires only a loose upper bound on the opponent's memory size. Second, and more importantly, CMLeS improves upon the state of the art algorithm, by promising targeted optimality against adaptive opponents by requiring sufficient number of visits to only all feasible joint histories of size K, where K is the opponent's memory size. Right now, the guarantees of CMLeS are only in self-play or when all other agents
5

http://gamut.stanford.edu/userdoc.pdf