Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria Elad Hazan IBM Almaden Research Center 650 Harry Road San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Computer Science Department, Princeton University 35 Olden St. Princeton, NJ 08540 satyen@cs.princeton.edu Abstract We study the relation between notions of game-theoretic equilibria which are based on stability under a set of deviations, and empirical equilibria which are reached by rational players. Rational players are modeled by players using no regret algorithms, which guarantee that their payoff in the long run is close to the maximum they could hope to achieve by consistently deviating from the algorithm's suggested action. We show that for a given set of deviations over the strategy set of a player, it is possible to efficiently approximate fixed points of a given deviation if and only if there exist efficient no regret algorithms resistant to the deviations. Further, we show that if all players use a no regret algorithm, then the empirical distribution of their plays converges to an equilibrium. 1 Introduction We consider a setting where a number of agents need to repeatedly make decisions in the face of uncertainty. In each round, the agent obtains a payoff based on the decision she chose. Each agent would like to be able to maximize her payoff. While this might seem like a natural objective, it may be impossible to achieve without placing restrictions on the kind of payoffs that can arise. For instance, if the payoffs were adversarially chosen, then the agent's task would become essentially hopeless. In such a situation, one way for the agent to cope with the uncertainty is to aim for a relative benchmark rather an absolute one. The notion of regret minimization captures this intuition. We imagine that the agent has a choice of several well-defined ways to change her decision, and now the agent aims to maximize her payoff relative to what she could have obtained had she changed her decisions in a consistent manner. As an example of what we mean by consistent changes, a possible objective could be to maximize her payoff relative to the most she could have achieved by choosing some fixed decision in all the rounds. The difference between these payoffs is known as external regret in the game theory literature. Another notion is that of internal regret, which arises when the possible ways to change are the ones that switch from some decision i to another, j , whenever the agent chose decision i, leaving all other decisions unchanged. A learning algorithm for an agent is said to have no regret with respect to an associated set of decision modifiers (also called deviations) if the average payoff of an agent using the algorithm converges to the largest average payoff she would have achieved had she changed her decisions using a fixed decision modifier in all the rounds. Based on what set of decision modifiers are under consideration, various no regret algorithms are known (for e.g. Hannan [9] gave algorithms to minimize external regret, and Hart and Mas-Collel [10] give algorithms to minimize internal regret). 1 The reason no regret algorithms are so appealing, apart from the fact that they model rational behavior of agents in the face of uncertainty, is that in various cases it can be shown that using no regret algorithms guides the overall play towards a game theoretic equilibrium. For example, Freund and Schapire [7] show that in a zero-sum game, if all agents use a no external regret algorithm, then the empirical distribution of the play converges to the set of minimax equilibria. Similarly, Hart and Mas-Collel [10] show that if all agents use a no internal regret algorithm, then the empirical distribution of the play converges to the set of correlated equilibria. In general, given a set of decision modifiers , we can define a notion of game theoretic equilibrium that is based on the property of being stable under deviations specified by . This is a joint distribution on the agents' decisions that ensures that the expected payoff to any agent is no less than the most she could achieve if she decided to unilaterally (and consistently) decided to deviate from her suggested action using any decision modifier in . One can then show that if all agents use a -no regret algorithm, then the empirical distribution of the play converges to the set of -equilibria. This brings us to the question of whether it is possible to design no regret algorithms for various sets of decision modifiers . In this paper, we design algorithms which achieve no regret with respect to for a very general setting of arbitrary convex compact decision spaces, arbitrary concave payoff functions, and arbitrary continuous decision modifiers. Our method works as long as it is possible to compute approximate fixed points for (convex combinations) of decision modifiers in . Our algorithms are based on a connection to the framework of Online Convex Optimization (see, e.g. [17]) and we show how to apply known learning algorithms to obtain -no regret algorithms. The generality of our connection allows us to use various sophisticated Online Convex Optimization algorithms which can exploit various structural properties of the utility functions and guarantee a faster rate of convergence to the equilibrium. Previous work by Greenwald and Jafari [8] gave algorithms for the case when the decision space is the simplex of probability distributions over the agents' decisions, the payoff functions are linear, and the decision modifiers are also linear. Their algorithm, based on the work of Hart and MasCollel [10], uses a version of Blackwell's Approachability Theorem, and also needs to computes fixed points of the decision modifiers. Since these modifiers are linear, it is possible to compute fixed points for them by computing the stationary distribution of an appropriate stochastic matrix (say, by computing its top eigenvector). Computing Brouwer fixed points of continuous functions is in general a very hard problem (it is PPAD-complete, as shown by Papadimitriou [14]). Fixed points are ubiquitous in game theory. Most common notions of equilibria in game theory are defined as the set of fixed points of a certain mapping. For example, Nash Equilibria (NE) are the set of fixed points of the best response mapping (appropriately defined to avoid ambiguity). The fact that Brouwer fixed points are hard to compute in general is no reason why computing specific fixed points should be hard (for instance, as mentioned earlier, computing fixed points of linear functions is easy via eigenvector computations). More specifically, could it be the case that the NE, being a fixed point of some well-specified mapping, is easy to compute? These hopes were dashed by the work of [6, 3] who showed that computing NE is as computationally difficult as finding fixed points in a general mapping: they show that computing NE in a two-player game is PPAD-complete. Further work showed that even computing an approximate NE is PPAD-complete [4]. Since our algorithms (and all previous ones as well) depend on computing (approximate) fixed points of various decision modifiers, the above discussion leads us to question whether this is necessary. We show in this paper that indeed it is: a -no-regret algorithm can be efficiently used to compute approximate fixed points of any convex combination of decision modifiers. This establishes an equivalence theorem, which is the main contribution of this paper: there exist efficient -no-regret algorithms if and only it is possible to efficiently compute fixed points of convex combinations of decision modifiers in . This equivalence theorem allows us to translate complexity theoretic lower bounds on computing fixed points to designing no regret algorithms. For instance, a Nash equilibrium can be obtained by applying Brouwer's fixed point theorem to an appropriately defined continuous mapping from the compact convex set of pairs of the players' mixed strategies to itself. Thus, if contains this mapping, then it is PPAD-hard to design -no-regret algorithms. It was recently brought to our attention that Stolz and Lugosi [16], building on the work of Hart and Schmeidler [11], have also considered -no-regret algorithms. They also show how to design them 2 from fixed-point oracles, and proved convergence to equilibria under even more general conditions than we consider. The focus of our results is on the computational aspect of such reductions, and the equivalence of fixed-points computation and no-regret algorithms. 2 Preliminaries 2.1 Games and Equilibria We consider the following kinds of games. First, the set of strategies for the players of the game is a convex compact set. Second, the utility functions for the players are concave over their strategy sets. To avoid cumbersome notation, we restrict ourselves to two player games, although all of our results naturally extend to multi-player games. Formally, for i = 1, 2, player i plays points from a convex compact set Ki Rni . Her payoff is given by function ui : K1 × K2 R, i.e. if x1 , x2 is the pair of strategies played by the two players, then the payoff to player i is given by ui (x1 , x2 ). We assume that u1 is a concave function of x1 for any fixed x2 , and similarly u2 is a concave function of x2 for any fixed x1 . We now define a notion of game theoretic equilibrium based on the property of being stable with respect to consistent deviations. By this, we mean an online game-playing strategy for the players that will guarantee that neither stands to gain if they decided to unilaterally, and consistently, deviate from their suggested moves. To model this, assume that each player i has a set of possible deviations i which is a finite1 set of continuous mappings i : Ki Ki . Let = (1 , 2 ). Let be a joint distribution on K1 × K2 . If it is the case that for any deviation 1 1 , player 1's expected payoff obtained by sampling x1 using is always larger than her expected payoff obtained by deviating to 1 (x1 ), then we call stable under deviations in 1 . The distribution is said to be a -equilibrium if is stable under deviations in 1 and 2 . A similar definition appears in [11] and [16]. Definition 1 (-equilibrium). A joint distribution over K1 × K2 is called a -equilibrium if the following holds, for any 1 1 , and for any 2 2 : u u 1 (x1 , x2 )(x1 , x2 ) 1 (1 (x1 ), x2 )(x1 , x2 ) u u 2 (x1 , x2 )(x1 , x2 ) 2 (x1 , 2 (x2 ))(x1 , x2 ) We say that is a -approximate -equilibrium if the inequalities above are satisfied up to an additive error of . Intuitively, we imagine a repeated game between the two players, where at equilibrium, the players' moves are correlated by a signal, which could be the past history of the play, and various external factors. This signal samples a pair of moves from an equilibrium joint distribution over all pairs of moves, and suggests to each player individually only the move she is supposed to play. If no player stands to gain if she unilaterally, but consistently, used a deviation from her suggested move, then the distribution of the correlating signal is stable under the set of deviations, and is hence an equilibrium. Example 1: Correlated Equilibria. A standard 2-player game is obtained when the Ki are the simplices of distributions over some base sets of actions Ai and the utility functions ui are bilinear in x1 , x2 . If the sets i consist of the maps a,b : Ki Ki for every pair a, b Ai defined as if c = a 0 a,b (x)[c] = xa + xb if c = b (1) xc otherwise then it can be shown that any -equilibrium can be equivalently viewed as a correlated equilibrium of the game, and vice-versa. 1 It is highly plausible that the results in this paper extend to the case where is infinite ­ indeed, our results hold for any set of mappings which is obtained by taking all convex combinations of finitely many mappings ­ but we restrict to finite in this paper for simplicity. 3 Example 2: The Stock Market game. Consider the following setting: there are two investors (the generalization to many investors is straightforward), who invest their wealth in n stocks. In each period, they choose portfolios x1 and x2 over the n stocks, and observe the stock returns. We model the stock returns as a function r of the portfolios x1 , x2 chosen by the investors, and it maps the portfolios to the vector of stock returns. We make the assumption that each player has a small influence on the market, and thus the function r is insensitive to the small perturbations in the input. The wealth gain for each investor i is r(x1 , x2 ) · xi . The standard way to measure performance of an investment strategy is the logarithmic growth rate, viz. log(r(x1 , x2 ) · xi ). We can now define the utility functions as ui (x1 , x2 ) = log(r(x1 , x2 ) · xi ). Intuitively, this game models the setting in which the market prices are affected by the investments of the players. A natural goal for a good investment strategy would be to compare the wealth gain to that of the best fixed portfolio, i.e. i is the set of all constant maps. This was considered by Cover in his Universal Portfolio Framework [5]. Another possible goal would be to compare the wealth gained to that achievable by modifying the portfolios using the a,b maps above, as considered by [15]. In Section 3, we show that the stock market game admits algorithms that converge to an -equilibrium in O( 1 log 1 ) rounds, whereas all previous algorithms need O( 1 ) rounds. 2 2.2 No regret algorithms The online learning framework we consider is called online convex optimization [17], in which there is a fixed convex compact feasible set K Rn and an arbitrary, unknown sequence of concave payoff functions f (1) , f (2) , . . . : K R. The decision maker must make a sequence of decisions, where the tth decision is a selection of a point x(t) K and obtains a payoff of f (t) (x(t) ) on period t. The decision maker can only use the previous points x(1) , . . . , x(t-1) , and the previous payoff functions f (1) , . . . , f (t-1) to choose the point x(t) . The performance measure we use to evaluate online algorithms is regret, defined as follows. The decision maker has a finite set of N decision modifiers which, as before, is a set of continuous mappings from K K . Then the regret for not using some deviation is the excess payoff the decision maker could have obtained if she had changed her points in each round by applying . Definition 2 (-Regret). Let be a set of continuous functions from K K . Given a set of T concave utility functions f1 , ..., fT , define the -regret as Regret (T ) = max tT =1 f (t) ((x(t) )) - tT =1 f (t) (x(t) ). Two specific examples of -regret deserve mention. The first one is "external regret", which is defined when is the set of all constant mappings from K to itself. The second one is "internal regret", which is defined when K is the simplex of distributions over some base set of actions A, and is the set of the a,b functions (defined in (1)) for all pairs a, b A. A desirable property of an algorithm for Online Convex Optimization is Hannan consistency: the regret, as a function of the number of rounds T , is sublinear. This implies that the average per iteration payoff of the algorithm converges to the average payoff of a clairvoyant algorithm that uses the best deviation in hindsight to change the point in every round. For the purpose of this paper, we require a slightly stronger property for an algorithm, viz. that the regret is polynomially sublinear as a function of T . Definition 3 (No -regret algorithm). A no -regret algorithm is one which, given any sequence of concave payoff functions f (1) , f (2) , . . ., generates a sequence of points x(1) , x(2) , . . . K such that for all T = 1, 2, . . ., Regret (T ) = O(T 1-c ) for some constant c > 0. Such an algorithm will be called efficient if it computes x(t) in poly(n, N , t, L) time. In the above definition, L is a description length parameter for K , defined appropriately depending on how the set K is represented. For instance, if K is the n-dimensional probability simplex, then L = n. If K is specified by means of a separation oracle and inner and outer radii r and R, then L = log(R/r), and we allow poly(n, N , t, L) calls to the separation oracle in each iteration. 4 The relatively new framework of Online Convex Optimization (OCO) has received much attention recently in the machine learning community. Our no -regret algorithms can use any of wide variety of algorithms for OCO. In this paper, we will use Exponentiated Gradient (EG) algorithm ([13], [1]), which has the following (external) regret bound: Theorem 1. Let the domain K be the simplex of distributions over a base set of size n. Let G be an upper bound on the L norm of the gradients of the payoff functions, i.e. G supxK f (t) (x) . Then the EG algorithm generates points x(1) , . . . , x(T ) such that max xK tT =1 f (t) (x) - tT =1 f (t) (x(t) ) O(G l og(n)T ) If the utility functions are strictly concave rather than linear, even stronger regret bounds, which depend on log(T ) rather than T , are known [12]. While most of the literature on online convex optimization focuses on external regret, it was observed that any Online Convex Optimization algorithm for external regret can be converted to an internal regret algorithm (for example, see [2], [15]). 2.3 Fixed Points As mentioned in the introduction, our no regret algorithms depend on computing fixed points of the relevant mappings. For a given set of deviations , denote by CH() the set of all convex combinations of deviations in , i.e. . CH() = : 0 and = 1 Since each map CH() is a continuous function from K K , and K is a convex compact domain, by Brouwer's fixed theorem, has a fixed point in K , i.e. there exists a point x K such that (x) = x. We consider algorithms which approximate fixed points for a given map in the following sense. Definition 4 (FPTAS for fixed points of deviations). Let be a set of N continuous functions from K K . A fully polynomial time approximation scheme (FPTAS) for fixed points of is an algorithm, which, given any function CH() and an error parameter > 0, computes a point x K such that (x) - x in poly(n, N , L, 1 ) time. 3 Convergence of no -regret algorithms to -equilibria In this section we prove that if the players use no -regret algorithms, then the empirical distribution of the moves converges to a -equilibrium. [10] shows that if players use no internal regret algorithms, then the empirical distribution of the moves converges to a correlated equilibrium. This was generalized by [8] to any set of linear transformations . The more general setting of this paper also follows easily from the definitions. A similar theorem was also proved in [16]. The advantage of this general setting is that the connection to online convex optimization allows for faster rates of convergence using recent online learning techniques. We give an example of a natural game theoretic setting with faster convergence rate below. Theorem 2. If each player i chooses moves using a no i -regret algorithms, then the empirical game distribution of the players' moves converges to a -equilibrium. Further, an -approximate 1 -equilibrium is reached after T iterations for the first T which satisfies T Regret (T ) . Proof. Consider the first player. In each game iteration t, let (x1 (t) , x2 (t) ) be the pair of moves played by the two players. From player 1's point of view, the payoff function she obtains, f (t) , is the following: x K1 : f (t) (x) u1 (x, x2 (t) ). Note that this function is concave by assumption. Then we have, by definition 3, t t Regret1 (T ) = max f (t) ((x1 (t) )) - f (t) (x1 (t) ). 5 Rewriting this in terms of the original utility function, and scaling by the number of iterations we get T T 1t 1t 1 (t) (t) u1 (x1 , x2 ) u1 ((x1 (t) ), x2 (t) ) - Regret1 (T ). T =1 T =1 T Denote by (T ) the empirical distribution of the played strategies till iteration T , i.e. the distribution 1 which puts a probability mass of T on all pairs (x1 (t) , x2 (t) ) for t = 1, 2, . . . , T . Then, the above inequality can be rewritten as u u 1 (T ) (x1 , x2 )(T ) (x1 , x2 ) (x1 , x2 ) - Regret1 (T ). 1 1 ((x1 ), x2 ) T A similar inequality holds for player 2 as well. Now assume that both players use no regret algorithms, which ensure that Regreti (T ) O(T 1-c ) for some constant c > 0. Hence as T , we 1 have T Regreti (T ) 0. Thus (T ) converges to a -equilibrium. Also, (T ) is a -approximate 1 1 equilibrium as soon as T is large enough so that T Regret1 (T ) and T Regret2 (T ) are less than , 1 i.e. T ( 1/c ). A corollary of Theorem 2 is that we can obtain faster rates of convergence using recent online learning techniques, when the payoff functions are non-linear. This is natural in many situations, since risk aversion is associated with the concavity of utility functions. Corollary 3. For the stock market game as defined in section 2.1, there exists no regret algorithms which guarantee convergence to an -equilibrium in O( 1 log 1 ) iterations. Proof sketch. The utility functions observed by the investor i in the stock market game are of the form ui (x1 , x2 ) = log(r(x1 , x2 ) · xi ). This logarithmic utility function is exp-concave, by the assumption on the insensitivity of the function r to small perturbations in the input. Thus the online algorithm of [5], or the more efficient algorithms of [12] can be applied. In the full version of this paper, we show that Lemma 6 can be modified to obtain algorithms with Regreti (T ) = O(log T ). By the Theorem 2 above, the investors reach -equilibrium in O( 1 log 1 ) iterations. 4 Computational Equivalence of Fixed Points and No Regret algorithms In this section we prove our main result on the computational equivalence of computing fixed points and designing no regret algorithms. By the result of the previous section, players using no regret algorithms converge to equilibria. We assume that the payoff functions f (t) are scaled so that the (L2 ) norm of their gradients is bounded by 1, i.e. f (t) 1. Our main theorem is the following: Theorem 4. Let be a given finite set of deviations. Then there is a FPTAS for fixed points of if and only if there exists an efficient no -regret algorithm. The first direction of the theorem is proved by designing utility functions for which the no regret property will imply convergence to an approximate fixed point of the corresponding transformations. The proof crucially depends on the fact that no regret algorithms have the stringent requirement that their worst case regret, against arbitrary adversarially chosen payoff functions, is sublinear as a function of the number of the rounds. Lemma 5. If there exists a no -regret algorithm then there exists an FPTAS for fixed points of . Proof. Let 0 CH() be a given mapping whose fixed point we wish to compute. Let be a given error parameter. At iteration t, let x(t) be the point chosen by A. If 0 (x(t) ) - x(t) , we can stop, because we have found an approximate fixed point. Else, supply A with the following payoff function: f (t) (x) ( 0 (x(t) ) - x(t) ) x - x(t) ) (t) (t) ( 0 (x ) - x 6 This is a linear function, with f (t) (x) = 1. Also, f (t) (x(t) ) = 0, and f (t) (0 (x(t) )) = (t) (t) 0 (x ) - x . After T iterations, since 0 is a convex combination of functions in , and since all the f (t) are linear functions, we have max tT =1 f (t) ((x )) (t) tT =1 f (t) (0 (x(t) )) T . Thus, Regret (T ) = max t f (t) ((x(t) )) - t f (t) (x(t) ) T . (2) Since A is a no-regret algorithm, assume that A ensures that Regret (T ) = O(T 1-c ) for some constant c > 0. Thus, when T = ( 11 c ) the lower bound (2) on the regret cannot hold unless we / have already found an -approximate fixed point of 0 . The second direction is on the lines of the algorithms of [2] and [15] which use fixed point computations to obtain no internal regret algorithms. Lemma 6. If there is an FPTAS for fixed points of , then there is an efficient no -regret algorithm. In fact, the algorithm guarantees that Regret (T ) = O( T ). 2 Proof. We reduce the given OCO problem to an "inner" OCO problem. The "outer" OCO problem is the original one. We use a no external regret algorithm for the inner OCO problem to generate points in K for the outer one, and use the payoff functions obtained in the outer OCO problem to generate appropriate payoff functions for the inner one. Let = {1 , 2 , . . . , N }. The domain for the inner OCO problem is the simplex of all distributions on , denoted N . For a distribution N , let i be the probability measure assigned to i in the distribution . There is a natural mapping from N CH(): for any N , denote N by the function i=1 i i CH(). Let x(t) K be the point used in the outer OCO problem in the tth round, and let f (t) be the obtained payoff function. Then the payoff functions for the inner OCO problem is the function g (t) : N R defined as follows: N : g (t) () f (t) ( (x(t) )). We now apply the Exponentiated Gradient (EG) algorithm (see Section 2.2) to the inner OCO problem. To analyze the algorithm, we bound g (t)i as follows. Let x0 be an arbiirary point in K . t (t) (t) (t) We can rewrite g as g () = f (x0 + i (i (x(t) ) - x0 )), because i = 1. Then, (t) (t) (t) (t) (t) th g =X f ( (x )), where X is an N × n matrix whose i row is (i (x(t) ) - x0 ) . Thus, g (t) = max |(i (x(t) )-x0 ) i f (t) ( (x(t) ))| i (x(t) )-x0 f (t) ( (x(t) )) 1. The last inequality follows because we assumed that the diameter of K is bounded by 1, and the norm of the gradient of f (t) is also bounded by 1. Let (t) be the distribution on produced by the EG algorithm at time t. Now, the point x(t) is 1 computed by running the FPTAS for computing an t -approximate fixed point of the function (t) , i.e. we have (t) (x(t) ) - x(t) 1 . t Now, using the definition of the g (t) functions, and by the regret bound for the EG algorithm, we have that for any fixed distribution N , tT =1 f (t) ( (x(t) )) - tT =1 f (t) ((t) (x(t) )) = tT =1 g (t) () - tT =1 g (t) ((t) ) O( l og(N )T ). (3) 2 In the full version of the paper, we improve the regret bound to O(log T ) under some stronger concavity assumptions on the payoff functions. 7 Since f (t) 1, f (t) ((t) (x(t) )) - f (t) (x(t) ) (t) (x(t) ) - x(t) 1 . t (4) Summing (4) from t = 1 to T , and adding to (3), we get that for any distribution over , tT =1 f (t) ( (x(t) )) - t f (t) (x(t) ) O( l og(N )T ) + tT =1 l 1 = O( og(N )T ). t In particular, by concentrating on any glven i , the above inequality implies that i T T (t) (t) (t) (t) (i (x )) - t=1 f (x ) O( og(N )T ), and thus we have a no -regret alt=1 f gorithm. References [1] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta algorithm and applications. Manuscript, 2005. [2] A. Blum and Y. Mansour. From external to internal regret. In COLT, pages 621­636, 2005. [3] X. Chen and X. Deng. Settling the complexity of two-player nash equilibrium. In 47th FOCS, pages 261­272, 2006. [4] X. Chen, X. Deng, and S-H. Teng. Computing nash equilibria: Approximation and smoothed complexity. focs, 0:603­612, 2006. [5] T. Cover. Universal portfolios. Math. Finance, 1:1­19, 1991. [6] C. Daskalakis, P. W. Goldberg, and C. H. Papadimitriou. The complexity of computing a nash equilibrium. In 38th STOC, pages 71­78, 2006. [7] Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29:79­103, 1999. [8] A. Greenwald and A. Jafari. A general class of no-regret learning algorithms and game-theoretic equilibria, 2003. [9] J. Hannan. Approximation to bayes risk in repeated play. In M. Dresher, A. W. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume III, pages 97­139, 1957. [10] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127­1150, 2000. [11] S. Hart and D. Schmeidler. Existence of correlated equilibria. Mathematics of Operations Research, 14(1):18­25, 1989. [12] E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex optimization. In 19'th COLT, 2006. [13] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Inf. Comput., 132(1):1­63, 1997. [14] C. H. Papadimitriou. On the complexity of the parity argument and other inefficient proofs of existence. J. Comput. Syst. Sci., 48(3):498­532, 1994. [15] G. Stoltz and G. Lugosi. Internal regret in on-line portfolio selection. Machine Learning, 59:125­159, 2005. [16] G. Stoltz and G. Lugosi. Learning correlated equilibria in games with compact sets of strategies. Games and Economic Behavior, 59:187­208, 2007. [17] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In 20th ICML, pages 928­936, 2003. 8