Learning When to Take Advice: A Statistical Test for Achieving A Correlated Equilibrium Greg Hines Cheriton School of Computer Science University of Waterloo Waterloo, Canada ggdhines@cs.uwaterloo.ca Kate Larson Cheriton School of Computer Science University of Waterloo Waterloo, Canada klarson@cs.uwaterloo.ca gestions. Even if a mediator tries to make good suggestions it may be prevented by coding errors, memory limitations, etc. For an agent to accept a mediator's suggestions, there must be some way for the agent to verify that the suggestions are reasonable. A mediator might not be willing to share its code with the agents, or be aware of its own limitations. Therefore, for a truly robust system, the agents themselves must have a way of checking the mediator's suggestions. Thus, this paper introduces a statistical test based on hypothesis testing that, with high probability, can verify the mediator's suggestions. While hypothesis testing has been proposed in the multiagent learning literature as a tool that agents might use to learn how to play Nash equilibria [5], to the best of our knowledge it has never been applied for validating a mediator's advice. Based on our test, we propose an algorithm that allows agents to converge to the mediator's suggestion if it is a correlated equilibrium and otherwise, in the limit, be no worse off for having used our algorithm. We then generalize this algorithm to a more theoretical setting where we show that with probability one, in the limit, our test will always be able to correctly verify the mediator's suggestions. This provides a method for achieving convergence to a specific correlated equilibrium. Abstract We study a multiagent learning problem where agents can either learn via repeated interactions, or can follow the advice of a mediator who suggests possible actions to take. We present an algorithm that each agent can use so that, with high probability, they can verify whether or not the mediator's advice is useful. In particular, if the mediator's advice is useful then agents will reach a correlated equilibrium, but if the mediator's advice is not useful, then agents are not harmed by using our test, and can fall back to their original learning algorithm. We then generalize our algorithm and show that in the limit it always correctly verifies the mediator's advice. 1 Introduction In settings where agents repeatedly interact with each other (for example, through a repeated game), there are great opportunities for learning since agents are able to adapt their strategies given the history of play. This problem has garnished a lot of attention from several research communities, including the AI community and the game theory community. While many criteria have been proposed for measuring the success of learning approaches, one commonly used measure is whether the agents learn how to best-respond to the strategies being played by the others. That is, does the learning process converge to an equilibrium. In this paper we study the problem of agents interacting with each other in a repeated game setting, but we introduce a third party mediator or advisor who makes strategy suggestions to the agents. Ideally, by following the suggestions of the mediator, agents will be able to learn how to play against each other, possibly even reaching mutually beneficial outcomes which would not have been possible without the mediation. That is, our goal is for the agents to learn and adapt so that they find a correlated equilibrium [1]. However, a mediator is only useful if it can make good sug- 2 Background In this section we introduce the key concepts and assumptions used in this paper. A n-agent stage game is a tuple G = N, A = A1 × . . . × An , u1 , . . . , un , where N = {1, . . . , n} is the set of agents, Ai is the set of possible actions for agent i and A is the set of possible joint actions, and ui : A R is the utility function for agent i. Without loss of generality, we assume that all utilities are greater than or equal to 0. A specific action for agent i is ai Ai , and a joint action is a = (a1 , . . . , an ). We assume that A is public knowledge but the agents' utility functions are private. Each agent chooses its actions according to some strategy. A strategy for agent i, i , is a probability distribution over Ai , stating with what probability the agent will play each possible action. The set of all possible strategies for agent i is i . The vector = (1 , . . . , n ) is a strategy profile which specifies a strategy for each agent and is the set of all possible strategy profiles. We use -i to denote (1 , . . . , i-1 , i+1 , . . . , n ). Given a strategy profile , we define the expected utility for ag en t i as ui ( ) = a ui (a)n=1 j (aj ). j (1 ) st 1 Ag e n t 1 at 1 Mediator st 2 Ag e n t 2 at 2 ut 1 ut 2 Ga m e =(a1 ,...,an )A Figure 1: A graphical representation of our setting with 2 agents at time t. In this paper, we are interested in a setting where agents have the ability to learn and adapt to the actions taken by others. Thus, we study repeated games. A repeated game Gr = (G1 , G2 , . . .) is an infinite sequence of the stage game G played repeatedly. Agent i's action at time t is at and the joint action at time t is at . The history of i joint actions, hist(t) = {a1 , . . . , at-1 }, is a record of the joint action taken at each iteration until time t. The empirical, or observed, percentage of play of joint actions, hist(t) A , is the percentage of time each joint action has been played as of time t. Agents may learn from previous iterations of the game to try and improve their strategy. Specifically, we assume that agent i has a learning algorithm Li : hist(t) i , that helps agent i select a strategy for time t. t Let A be the actual correlated strategy at time t, i.e. the one agents are actually using and not necessarily the t one based on M's suggestions. We say that A con verges to a correlated equilibrium if for some A C (G), t limt A = A . Thus, our algorithm is differentiated from algorithms that achieve convergence to the set of correlated equilibrium, for example [4, 8]. Each agent's utility is dependent not just on its own actions, but also on the actions taken by all other agents. We assume agents are rational, i.e., given -i , agent i will choose a strategy which maximizes its expected utility. In our model we introduce a third-party mediator, M. The mediator knows the utility functions for all agents, but is not affected by the game's outcome. Instead M makes suggestions to each agent as to what action it should take, where these suggestions are instantiations of a correlated strategy. Definition 1. A correlated strategy, A , is a probability distribution over A. We let s A denote an instantiation of A . The conditional correlated strategy A-i (s-i |si ) is the conditional probability of the joint signal (si , s-i ) given the signal si , and A-i (si ) is the set of all conditional probabilities given si . Note that i is a probability distribution over Ai while A is a probability distribution over A. We assume that M's correlated strategy is public knowledge, but the actual instantiation, s, is not. In particular we assume that M sends each agent i a private signal, si , based on s. The agents are under no obligation to follow the mediator's signals. It is up to the mediator to pick a correlated strategy that a rational agent would be willing to follow. Note that our type of a mediator is different than Monderer and Tennenholtz's, where agents must agree to follow the mediator's suggested actions before knowing what they are [11]. Definition 2. A correlated strategy A = {A (a)|a A} is a correlated equilibrium if for every agent i and every si Ai , s A-i (s-i |si )ui (si , s-i ) (2 ) -i A -i 3 Setup The setting for our paper is a repeated game Gr with a mediator, M. As illustrated for the two agent case in Figure 1, time t will begin with the mediator giving each agent a suggested action, st . Agents will then simultaneously i choose their action, at , which may or may not be st . If i i agent i chooses not to follow M's signal, it can instead use a learning algorithm, Li , which we assume is independent of M's signals, to select an action. Based on the actual joint action, each agent will then receive some utility and the process repeats. The mediator's signal to each agent is private information, known only to that agent and the mediator, as is the agent's utility function. However, the action set for each agent is public knowledge, as is the action taken by each agent during a turn. The mediator's signals are based on a selected correlated M strategy, A , which is constant throughout the repeated game. Although ideally the mediator will suggest a correlated strategy that is also a correlated equilibrium, each s A-i (s-i |si )ui (ai , s-i ), -i A -i for all ai Ai [1]. The set of all correlated equilibria in G is C (G). If all of agent i's opponents are following a correlated equi librium A , it is rational for agent i to also follow A . agent still needs to verify that the mediator has actually done so. Our aim is to design an algorithm that achieves the following goals. M t First goal: If A is a correlated equilibrium then A , the actual correlated strategy which is not necessarily M M A , will converge to A . M Second goal: If A is not a correlated equilibrium, agents should be no worse off, in the limit, for having used our algorithm. Since the utilities for each agent, as well as the signals they receive each turn, are private, there may be no way to prove or disprove Condition 2 with absolute certainty at any finite point during the game. The best i can do is reach a probabilistic conclusion. Since joint actions are public knowledge, i can compare the empirical percentages of play for the duration of the sampling test against the percentM ages predicted by A . If the difference between these two values is statistically significant, there is a high probability that at least one agent has stopped following the mediator's signals. To test if there is a difference, agent i assumes there is some fixed but unknown correlated strategy A that all agents ~ were actually using for the sampling test, where A may or ~ M may not be A . We are now able to use hypothesis testing, M where our null hypothesis is that A is equal to A , i.e., ~ M H0 : A = A , ~ M A In Section 4, we present an algorithm, , that achieves these goals with high probability. In Section 5, we generalize so that, with probability one, in the limit, it will achieve both goals. Since each agent will be using independently, we refer to i as the instance of the algorithm being run by agent i and as the joint algorithm. The algorithm is based on the concept of giving M the benefit of the doubt; until there is reason to believe otherwise, M agents assume that A is a correlated equilibrium and follow M's signals. Specifically, agents will assume that the following conditions hold. M Condition 1: The correlated strategy A is a correlated equilibrium. (3 ) and our alternative hypothesis is that is not equal to A , i.e., ~ M ~ (4 ) H1 : A = A . The test statistic used is Pearson's 2 test, a X (a) - E (a))2 , (5 ) T= E (a) ( A i Condition 2: All other agents are following the signals M based on A . Agents test whether these conditions hold during an initial period of play called a sampling test which has a fixed length of lT . If, at the beginning of the sampling test, agent i decides that one of the conditions does not hold, it will not follow M's signals and instead will use an individual "fall-back" strategy, i , chosen uniformly at random. At the end of the sampling test, all agents who still believe that both conditions hold will continue to follow M's signals. All other agents will start using their original learning algorithm. The algorithm i is correct if and only if, at the end of the sampling test, it correctly determines whether both conditions hold. The joint algorithm, , is correct if and only if i is correct for all i. where A s any subset of A such that |A | = |A| - 1, hist(l ) X (a) = lT A T (a) is the actual frequency of play of d M a A uring the sampling test, E (a) = lT A (a) is the M expected frequency of play according to A , and where lT is the length of the sampling period [12]. Note that hist(l ) A T is based on a sampling from A of size lT . For ~ M now we assume that A (a) > 0 for all a A. We relax this assumption later. The Pearson's 2 test has (in the limit) a probability distribution function of 2f + 2 C P,1 , d N (6 ) where the first distribution has df = |A| - 2 degrees of freedom, and the second distribution has 1 degree of freedom and a non-centrality parameter of N C P [9]. If H0 is true, N C P = 0. Assuming that H0 is true, we choose a significance level for rejection of the null hypothesis of < 1 and a corresponding critical value of c(), i.e., we reject the null hypothesis when T c(). In this case, the probability of incorrectly rejecting H0 (known as a Type 1 error) is p1 = . If H1 is actually true, we err when T < c() and we do not reject H0 (a Type 2 error). When H1 is true, N C P > 0. Since the non-centrality parameter determines how much the probability distribution in Equation 6 gets adjusted, determining N C P helps determine the probability of a Type 2 error. The equation for N C P is N C P = t , where , the sensitivity parameter, is a measure of the difference between M A and A given by ~ M (A , A ) = ~ 4 The Initial Algorithm In this section, we describe how our initial algorithm works. As a first step in i , agent i will check to see if Equation 2 holds for all si Ai . If Equation 2 does not hold, agent i will know that Condition 1 cannot be true. In this case, agent i will use a "fall-back" strategy, i i , picked uniformly at random, for the rest of the sampling test. If Equation 2 does hold, agent i must check to see if Condition 2 is true and will continue to follow M's signals throughout the sampling test. a M (A (a) - A (a))2 ~ . M A (a) A (7 ) ^ For a given value of , say , if M ^ (A , A ) , ~ (8 ) M A T (a ) > 0 while A (a ) = 0, the alternative hypothesis must be correct. Hence, both of these cases do not present problems. M The only other case is if for all a A such that A (a) = 0, hist(lT ) A (a) = 0 but, unknown to the agents, the alternative hypothesis is correct. In this case, a Type 2 error may occur. To find the probability of this case happening, we first determine the probability of at . Since any agent who rejects M's suggested strategy chooses its new strategy uniformly at random, the probability, P , that at for t lT is hist(l ) then the probability of a Type 2 error is bounded by some ^ value ( ) < 1, whose value is normally found via numerical computation [9]. Since is also a function of lT and , we refer to it as (lT , , ). Since agents do not know whether their opponents are following the mediator's suggestions, agents do not know the exact value for A , and therefore, it is impossible to choose ~ ^ an appropriate value for so that Equation 8 is guaranteed to hold. Instead, agents can consider a different question: what is the worst case situation under which Equation 8 does not hold? To answer this question, consider the set of all agents for whom Equation 2 does not hold, NB N . M Let (A-N , NB ) be the actual correlated strategy for the B duration of the sampling test, i.e., a combination of those agents who will follow M's signals and those who will rely on their fall-back strategy. Let NB be the set of all possible joint strategies for agents in NB , and M NB (A , ) M M = {NB NB | (A , (A-N , NB )) < } (9) B P a N mn A-N (a-N i N ) 1 , |AN | (1 3 ) where minN N is considered since agents do not know NB . Therefore, the probability that at for all t lT is at most (1 - P )lT and the overall probability of a Type 2 error is at most p2 (1 - P )lT [(1 - ) · + ] . (1 4 ) be the set of all possible joint strategies for agents in NB which would result in Equation 8 not holding. Let µ(NB ) M and µ(NB (A , )) be the Lebesgue measures of NB M and NB (A , ), respectively. Then, since i is chosen uniformly at random, the probability of NB being in M NB (A , ) is (NB ) = M µ(NB (A , )) To accommodate the worst case, we assume equality holds in Equation 14. Note that p1 has not changed. For simplicity, we assume that p1 = p2 = p, and refer to p as the overall probability of error. It is possible to rearrange (lT , , ) to express lT as a function of , and , i.e lT (, , ). As a result, lT is the sample size needed to perform the test with at most a probability of error (of either Type 1 or Type 2) of p. If all agents are to use the same value for lT , they must also have the same value for . This in turn requires them to have the same value for . To achieve this, in Equations 11 and 13, agent i will consider all possible N , including those containing agent i. 4.1 Examples In this section we provide two examples to illustrate how our test would work. Example 1: Consider the game in Figure 2. Ag e n t 2 a2,1 a2,2 0 ,1 2 ,5 5 ,2 1 ,0 µ(NB ) . (1 0 ) Since agents do not know NB , they consider the worst case scenario, = mx (N ). a (1 1 ) N N If we assume that whenever Equation 8 does not hold and M A = A , a Type 2 error is always made, then the proba~ bility of a Type 2 error is at most ^ p2 (1 - ) · ( ) + . (1 2 ) That is, Equation 8 holds with at least a probability of and when it does, the probability of a Type 2 error is at ^ most ( ) and with a probability of at most , Equation 8 does not hold. M If we do not assume that A (a) > 0 for all a A, then Equations 5 and 7 may contain division by zero. To deal with this, we ignore all a A such that M (a) = 0. If = {a A| M (a) = 0}, then the summations in Equations 5 and 7 need to exclude all a , and df in Equation 6 now equals |A| - 2 - | |. If the null hypothesis is corhist(l ) M rect then A (a) = 0 implies that A T (a) = 0 for all a . Alternatively, if there exists a A such that Ag e n t 1 a1,1 a1,2 Figure 2: A simple game Let A = {(a1,1 , a2,1 ), (a1,1 , a2,2 ), (a2,1 , a2,1 ), (a1,2 , a2,2 )}. Suppose that M announces a correlated strategy, M M A = {1/18, 5/18, 2/18, 10/18}. Note that A is a correlated equilibrium. Suppose the agents choose p = 0.1 and = 0.01. Agents must now determine the critical value for rejection, c(), and the length of the sampling test, lT . Since p1 = , = 0.1. For 3 degrees of freedom, c() = 6.25. Since M A (a) > 0 for all a, we can calculate by Equation 12. We calculate Equation 11 by numerical computation to find 0.09429. Therefore, = 0.0063. In practice, lT (, , ) would now be solved by some method of numerical computation [9]. For simplicity, we used the tables in Cohen to obtain a value of lT = 2100 [2]. Suppose that after 2100 iterations, we have obhist(2101) tained an empirical frequency of play A = {96, 601, 224, 1179}. Using Equation 5, we obtain a test statistic value of 4.678. Since this is lower than the critical value, both agents do not reject the null hypothesis and continue to use M's signals. Example 2: Consider a different example based on the same game where M announces a correlated strategy of M M A = {2/18, 10/18, 1/18, 5/18}. In this case, A is not a correlated equilibrium. Specifically, while Equation 2 is satisfied for Agent 1, it is not satisfied for Agent 2. Hence, Agent 2 will use a random fall-back strategy. Suppose 2 = (3/4, 1/4). For this example, the length of the test has not changed. hist(2101) Suppose we find an empirical frequency of A = {1050, 350, 525, 175} after 2100 turns. Since Agent 2 alM ready knows that A is not a correlated equilibrium, it will not perform the test. Agent 1 will obtain a test statistic value of 5953.3. This is well above the critical value and so Agent 1 will reject the null hypothesis, i.e., it will stop following the signals of the mediator. Note that, as we have stated our algorithm, Agent 1 will only know that there is a probability of at most 0.1 of incorrectly rejecting the null hypothesis. We have not accounted for the fact that the test statistic value is much higher than the critical value. An additional test that could be run after the null hypothesis is rejected is the calculation of the p-value. The p-value is the smallest value that would still allow us to reject the hypothesis [12]. In the case of the above example, the p-value would be very small, and M Agent 1 could be very certain that A is not a correlated equilibrium. 1 2 3 4 5 6 7 R1 R2 Figure 3: An example of repeated testing. where Rj = {bRj , lRj }, bRj is the first time period in Rj , and lRj is the length of Rj . The instance of i during test R Rj is denoted by i j . The repeated tests are not contiguous. A simple example is shown in Figure 3, where the timeline represents a repeated game up to 7 iterations. The grey areas represent sampling test iterations. For example, R2 = {bR2 , lR2 } = {4, 2}, meaning that the second test iteration begins at time period 4 and lasts for 2 iterations of the repeated game. The parameters, and p, can be set to depend on the test iteration, i.e. (Rj ) and p(Rj ). Each test period must be identical for each agent, i.e. Rj must be the same for all agents. This means that (Rj ) and p(Rj ) must be the same for all agents. The parameters are chosen such that j j =1 lim (Rj ) = 0, p(Rj ) < . (1 5 ) (1 6 ) For example, we can let (Rj ) = 1/j and p(Rj ) = 1/2j . Finally, we assume that each agent's fall-back strategy is R i R fixed. That is i j = j , for all j, j . Our first result is that an agent will not draw the wrong conclusion about the mediator too often. Theorem 1. In the limit, with probability one, there will only be a finite number of tests where Rj is incorrect. M Proof. Let A be the correlated strategy suggested by M. Consider the following two cases: M A is a correlated equilibrium: For test Rj , the probR ability of i j making a Type 1 error, p1 (Rj ), is equal to p(Rj ). By the Borel-Cantelli lemma, with probability one, R there will only be a finite number of times i j is incorrect, 1 i.e. makes a Type 1 error. This reasoning can be applied to all agents, and therefore with probability one there will only be a finite number of times Rj is incorrect. M M A is not a correlated equilibrium: If A is not a correlated equilibrium, then some subset of agents, N N , will use their fall-back strategies instead of following the mediator's signals. The resulting correlated strategy for evM ery test iteration will be (A-N , N ). 5 Repeated Testing The limitation of our basic test is that there is always some positive probability of error. This is due to the need to pick values for 1 - p and that are both greater than 0. Since we can pick any such values for 1 - p and , this is not much of a practical limitation, however we may wish to achieve a stronger theoretical result. Our goal is to have agents M converge to playing A if it is a correlated equilibrium. If M A is not a correlated equilibrium, then the agents' utility should be no worse off for having used our algorithm. This leads to the idea of repeated testing, where throughout the repeated game, agents will use multiple iterations of i . The set of repeated sampling tests is R = {R1 , R2 , . . .}, Since N 1 is fixed, by Equation 15, there exists a finite j Borel-Cantelli Lemma: Let {E t } be a sequence of in0 dependent evPts and P (E t ) be the probability of the event E t en occurring. If 0 P (E t ) < , then with probability one, only t= a finite number of the events will occur. such that for all j j , M M (A , (A-N , N )) (Rj ). (1 7 ) 2 for example lFj = lRj . This means that, in the limit, the length of the sampling periods is negligible compared to the length of the free periods. We also require that Let (Rj ) be the value of , according to Equation 11, during the sampling test Rj . Starting at Rj , we know that, with probability one, Equation 8 holds and therefore, since (Rj ) is the probability of Equation 8 not holding, (Rj ) = 0, for all j j . Therefore, the probability of a Type 2 error starting at Rj is p2 = j (1 - P )lT . (1 8 ) lRj = . j j lim (2 3 ) This means that the length of the sampling tests grows at faster than a linear rate. The specific values for lRj and lFj would have to be agreed upon by all agents. Definition 4. Let A 1 2 be the expected frequency of play from time t1 to t2 , i.e., the expected number of times each joint action a A gets played between times t1 and t2 inclusive. If t1 is not given, we assume t1 = 1. Similarly, exp(Fj ,...,Fj ) let A be the expected frequency of play during the free periods Fj through Fj , inclusive. Since the frequency of play depends on the algorithms the exp(t) agents are using, let A (L) be the expected frequency of play from time 1 to t assuming that agents use the joint learning algorithm L for the whole period. For simplicity in all of the following proofs, we assume that t always corresponds to the beginning of a sampling period. Let j (t) be the index of the last free period before t. The first step is to show that if M suggests a correlated equilibrium, agents will converge to it. Theorem 2. If the correlated strategy suggested by M, M A , is a correlated equilibrium, then with probability one, t t M lim A = A . exp(t ,t ) =j Note that P , lT and are all functions Rj , however we omit the notation (Rj ) for clarity. Since is less than 1, p2 = j (1 - P )lT [(1 - ) · + ] p(Rj ), (1 9 ) (2 0 ) =j j =j where , as calculated by Equation 11, is also a function of Rj . Therefore, by Equation 16 and the Borel-Cantelli lemma, with probability one, there will only be a finite R number of times i j is incorrect, i.e. makes a Type 2 error. Again, this reasoning can be generalized to all agents and therefore, there will only be a finite number of times Rj is incorrect. We now examine the behaviour of agents between sampling tests. The periods between test iterations are called free periods. The set of free periods is F = {F1 , . . .} where Fj = {bFj , lFj }. Thus Gr = {R1 , F1 , R2 , F2 , . . .}. For example, in Figure 3, the first free period, F1 , would be R {bF1 , lF1 } = {2, 2}. If i j did not reject the null hypothesis, agent i continues to follow M's signals for all of Fj . R If i j did reject the null hypothesis, agent i relies on its learning algorithm Li for Fj . We assume that Li is flexible at the beginning of each free period [3]. Definition 3. The learning algorithm Li is flexible if at the beginning of every free period Fj , Li (hist(bFj )) = Li (hist(1)). (2 1 ) (2 4 ) M Proof. If A is a correlated equilibrium then by Theorem 1, with probability one, after some finite point will alM ways correctly determine that A is a correlated equilibrium. As a result, with probability one, after some finite point, all agents will choose to follow the mediator's signals during the free periods. Therefore, during each free period, Li does not base its actions on what has happened before time bFj . For example, Li may be a trigger strategy, but that trigger may not be based on anything that has happened in a previous sampling test or free period. We require that lim j j =1 =1 Our next result is a technical lemma which shows that in the limit, agents are not harmed by taking time out to do the sampling tests. Lemma 1. In the limit, there is no difference between the average utility from agents using L for the whole repeated game and just for the free periods, i.e., exp(F ,...,F ) - = u exp(t) 1 j (t) (L) (L) A A ui 0. lim i t t t (2 5 ) Furthermore, this is true even when excluding the first j - 1 free periods, for some j > 1, i.e., u exp(F ,...,F ) - exp(t) j (t) j (L) (L) A A lim ui i t t t lRj lFj j j j = 0, (2 2 ) = (2 6 ) 0. The proof is given in the Appendix. M Finally, we need to show that if A 6 Conclusion The setting for this paper was a repeated game with a mediator. The mediator makes suggestions to the agents as to what actions to take. We presented a test that agents could use so that, with high probability, they could determine if the mediator's suggestion was a correlated equilibrium. We then generalized our algorithm to incorporate repeated testing so that in the limit, with probability one, the test will always correctly determine whether the mediator's suggested strategy is a correlated equilibrium. As a result, if the mediator suggests a correlated equilibrium, then agents will converge to it, and otherwise, be no worse off in the long run for having used our algorithm. We envision several directions for future research. First, it might be possible to extend our algorithm to work in radically uncoupled environments, where agents are not aware of the existence of others. This would significantly decrease the knowledge requirements of our test. Second, we would like to extend our approach so that the mediator receives feedback from the agents themselves, which can be used to help select appropriate correlated strategies. We believe that the incentive issues in such an approach will be challenging. It may also be interesting to apply our approach to other solution concepts such as mediated equilibria [11]. In a more applied direction, it might be possible to generalize our approach so it can be used in a stochastic game setting. Thus, our approach could be combined with methods such as Q-learning [7]. Correlated equilibria have also been used in graphical games, which can be used to model many different settings [10]. Hence, applying our technique to graphical games may yield some interesting results. For example, network games use graphical games to help represent a variety of problems, from public good provision and trade to information collection [6]. These models can be hindered by a "fundamental theoretical problem: even the simplest games played on networks have multiple equilibrium[sic] which display a bewildering range of possible outcomes" [6]. Our model may help integrate correlated equilibria as a possible solution to this problem. is not a correlated equilibrium, agents are no worse off, on average, for having used . Theorem 3. If the correlated strategy suggested by M, M A , is not a correlated equilibrium, then with probability one, - u exp(t) exp(t) (L) () A A ui 0 . (2 7 ) lim i t t t Therefore, in the limit, agent i will be no worse off for using instead of Li . M Proof. If A is not a correlated equilibrium, by Theorem 1, with probability one, starting at some sampling test, say M Rj , will always correctly determine that A is not a correlated equilibrium. Consider A with respect to some arbitary a A, denoted by a . We start by breaking the game down into the sequence of sampling tests and free periods. That is, exp(t) exp(R1 ,F1 ,...,F (t)) () = a (). For t t(j ), the a utility can be split up into the utility for the sampling tests and free periods before Rj and for those starting at Rj i.e., + u exp(R ,F ,...,R ,F ) 1 1 j -1 j -1 () a lim i t t exp(R ,F ,...,F (t)) S j j () a ui t 1 1 j -1 j -1 ince A () is constant, in the limit, the first term is 0, and so we are interested in (R ,F ,...,F (t)) T j j () a lim ui t t exp(R ,F ,...,R ,F ) he expected frequency can be split up into the expected frequency for the sampling periods and for the free periM ods. Since always determines that A is not a correlated equilibrium, during all the free periods agents will always use L, and so we are interested in (F ,...,F (t)) S + u (R ,...,R(t)) j j () (L) a a ui lim i t t t ince we assumed that all utilities are nonnegative, we may discard the first term, and thus have (F ,...,F (t) T j (L)) a lim ui t t herefore, by Lemma 1, the theorem follows. Together, Theorems 2 and 3 show that, with probability M one, if A is a correlated equilibrium, agents will converge M to it and if A is not a correlated equilibrium, agents will be no worse off in the long run for using . 7 Acknowledgements Our thanks to Gord Hines for his statistical advice. References [1] R. Aumann. Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1:67­96, 1974. [2] J. Cohen. Statistical Power Analysis for the Behavioral Sciences. 2nd edition, 1988. [3] D. P. D. Farias and N. Megiddo. Combining expert advice in reactive environments. Journal of the ACM, 53(5):762­799, 2006. [4] D. P. Foster and R. Vohra. Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21:40­55, 1997. [5] D. P. Foster and H. P. Young. Learning, hypothesis testing, and Nash equilibrium. Games and Economic Behavior, 45:73­96, 2003. [6] A. Galeotti, S. Goyal, M. O. Jackson, F. VegaRedondo, and L. Yariv. Network games. Unpublished, Jan 2006. [7] A. Greenwald and K. Hall. Correlated Q-learning. In Proceedings of ICML-2003, pages 242­249, Washington, DC, USA, 2003. [8] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68:1127­1150, 2000. [9] N. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distributions, volume 2. 1995. [10] S. Kakade, M. Kearns, J. Langford, and L. Ortiz. Correlated equilibria in graphical games. In EC '03: Proceedings of the 4th ACM Conference on Electronic Commerce, pages 42­47, New York, NY, USA, 2003. [11] D. Monderer and M. Tennenholtz. Strong mediated equilibrium. In Proceedings of the 21st American Association of Artificial Intelligence Conference, Boston, MA, USA, 2006. [12] L. Wasserman. All of Statistics. Springer, 2004. t= F1 F2 F3 . . . bFj bFj + lF1 w (1) a w (1) w (1) bFj + lF2 w (2) w (2) bFj + lF3 a a a a a w (3) Figure 4: A graphical representation of how the expected frequency of play will be repeated each free period. for all j such that lFj lFj . This relationship can be represented graphically, as shown in Figure 4, where for simplicity, we let w(j ) = exp(bFj + lFj-1 , bFj + lFj ), where lF0 = 0. Therefore, a exp(F1 ,...,Fj(t) ) (L) = j j (t) w (j (t) - j + 1)a (j ) (L). =1 w (j ) Note that a will be "represented" more than a fo r j < j and any finite t. In order for Equation 29 to hold, in w (j ) the limit, all a be must represented equally, i.e. j (t) - j + 1 j (t) - j = lim t t t t lim + w (j ) 1 , (3 1 ) for all j, j . Consider t(j ) = j -1 (t), i.e. the first time index after the j th free period has ended: j j t(j ) = j (lRj + lFj ) =1 j lRj . =1 (3 2 ) By Equation 23, limj t(j ) j = , and therefore, (3 3 ) A Proof of Lemma 1 t lim Proof. Consider with respect to a A, denoted by a . F ,...,Fj -1 (L) is constant, and therefore, Since j is fixed, a 1 lim a exp(F1 ,...,Fj -1 ) j (t) - j + 1 j (t) lim = 0. t t t w (j ) (L) t t = 0, (2 8 ) Therefore, in the limit, all a will be represented j w (j ) equally. However, since j (t) lFj < t, each a will =1 t be "underrepresented" compared to a (L) for any finite t. However, in the limit, this is not the case since, lim j (t) t j =1 lFj and therefore, Equations 25 and 26 are equivalent. Since the utility functions are linear transformations, proving the following is sufficient, although not necessary, to prove that Equation 25 holds, lim a exp(t) t = lim 1 t = lim (L) - a exp(F1 ,...,Fj(t) ) t Pj(t) j (t) +1 j =1 (lRj j (t) j =1 lFj + lFj ) (L) t t = 0. (2 9 ) j = 1 lR j Pj(t) j = 1 l Fj = 1 (by Equation 22). Therefore, in the limit a t compared to a (L). w (j ) (3 4 ) Since L is flexible, it will, in expectation, always behave the same way during each free period. Specifically, a exp(bFj ,bFj +lFj ) will be represented equally (L) = a exp(bFj ,bFj + l Fj ) (L), (3 0 )