More Efficient Internal-Regret-Minimizing Algorithms Amy Greenwald, Zheng Li, and Warren Schudy Brown University, Providence, RI 02912 {amy,ws}@cs.brown.edu and zheng@dam.brown.edu Abstract Standard no-internal-regret (NIR) algorithms compute a fixed point of a matrix, and hence typically require O(n3 ) run time per round of learning, where n is the dimensionality of the matrix. The main contribution of this paper is a novel NIR algorithm, which is a simple and straightforward variant of a standard NIR algorithm. However, rather than compute a fixed point every round, our algorithm relies on power iteration to estimate a fixed point, and hence runs in O(n2 ) time per round. Nonetheless, it is not enough to look only at the per-round run time of an online learning algorithm. One must also consider the algorithm's convergence rate. It turns out that the convergence rate of the aforementioned algorithm is slower than desired. This observation motivates our second contribution, which is an analysis of a multithreaded NIR algorithm that trades-off between its run time per round of learning and its convergence rate. 1 Introduction An online decision problem (ODP) consists of a series of rounds, during each of which an agent chooses one of n pure actions and receives a reward corresponding to its choice. The agent's ob jective is to maximize its cumulative rewards. It can work towards this goal by abiding by an online learning algorithm, which prescribes a possibly mixed action (i.e., a probability distribution over the set of pure actions) to play each round, based on past actions and their corresponding rewards. The success of such an algorithm is typically measured in a worst-case fashion: specifically, an adversary chooses the sequence of rewards that the agent faces. Hence, the agent--the protagonist--must randomize its play; otherwise, it can easily be exploited by the adversary. The observation that an ODP of this nature can be used to model a single player's perspective in a repeated game has spawned a growing literature connecting computational learning theory--specifically, the subarea of regret minimization--and game theory--specifically, the subarea of learning in repeated games. Both groups of researchers are interested in designing algorithms by which an agent can learn from its past actions, and the rewards associated with those actions, to play actions now and in the future that yield high rewards. More specifically, the entire sequence of actions should yield low regret for not having played otherwise, or equivalently, near equilibrium behavior. In a seminal paper by Foster and Vohra [FV97], it was established that the empirical distribution of the joint play of a particular class of online learning algorithms, called no-internal-regret (NIR) learners, converges to the set of correlated equilibria in repeated matrix games. However, standard NIR learning algorithms (see Cesa-Bianchi and Lugosi [CBL06] and Blum and Mansour [BM05])1 --including the algorithm proposed by Foster and Vohra (hereafter, FV)--involve a fixed point calculation during each round of learning, an operation that is cubic2 in the number of pure actions available to the player. Knowing that fixed point calculations are expensive, Hart and Mas-Colell [HMC00] describe "a simple adaptive procedure" (hereafter, HM) that also achieves the aforementioned convergence result. HM's per-round run time is linear in the number of pure actions. It is well-known [HMC00] that HM does not exhibit no internal regret in the usual sense, meaning against an adaptive adversary--one that can adapt in response to the protagonist's "realized" pure actions (i.e., those that result from sampling his mixed actions). Still, in a recent paper, Cahn [Cah04] has shown that HM's algorithm does exhibit no internal regret against an adversary that is "not too sophisticated." In this paper, we use the terminology nearly oblivious to refer to this 1 The former reference is to a b ook that surveys the field; the latter reference is to a pap er that includes a black-b ox method for constructing NIR learners from another class of learners called no-external-regret learners. 2 Strassen [Str69] devised an O(n2.81 )-time algorithm for matrix-matrix multiplication, based on which a fixed p oint can b e computed with the same run time [CLRS01]. Copp ersmith and Winograd [CW87] devised an O(n2.36 )-time algorithm for matrix-matrix multiplication, but unlike Strassen's result their result is impractical. For b etter p edagogy, we quote the "natural" O(n3 ) runtime in most of our discussions rather than these b etter b ounds. type of adversary, because the "not-too-sophisticated" condition is a weakening of the usual notion of an oblivious adversary--one who chooses the sequence of rewards after the protagonist chooses its online learning algorithm, but before the protagonist realizes any of its pure actions. Since an oblivious adversary is also nearly oblivious, Cahn's result implies that HM exhibits no internal regret against an oblivious adversary. As alluded to above, both FV and HM (and all the algorithms studied in this paper) learn a mixed action each round, and then play a pure action: i.e., a sample from that mixed action. One important difference between them, however, which can be viewed at least as a partial explanation of their varying strengths, is that FV maintains as its state the mixed action it learns, whereas HM maintains as its state the pure action it plays. Intuitively, the latter cannot exhibit no internal regret against an adaptive adversary because an adaptive adversary can exploit any dependencies between the consecutively sampled pure actions. Young [You04] proposes, but does not analyze rigorously, a variant of HM he calls Incremental Conditional Regret Matching (ICRM), which keeps track of a mixed action instead of a pure action, and hence exhibits no internal regret against an adaptive adversary.3 ICRM has quadratic run time each round. To motivate ICRM, recall that standard NIR algorithms involve a fixed-point calculation. Specifically, they rely on solutions to equations of the form q = q Pt , where Pt is a stochastic matrix that encodes the learner's regrets for its actions through time t. Rather than solve this equation exactly, ICRM takes qt+1 qt Pt as an iterative approximation of the desired fixed point. The regret matrix Pt used in ICRM (and HM) depends on a parameter ” that is strictly larger than the maximum regret per round. This makes ICRM less intuitive than it could be. We show that the same idea also works when the normalizing factor ”t is replaced by the actual total regret experienced by the learner. This simplifies the algorithm and eliminates the need for the learner to know or estimate a bound on the rewards. We call our algorithm Power Iteration (PI),4 because another more intuitive way to view it is as a modification of a standard NIR algorithm (e.g., Greenwald, et al. [GJMar]) with its fixed-point calculation replaced by power iteration. Once again, the first (and primary) contribution of this paper is a proof that using power iteration to estimate a fixed point, which costs only O(n2 ) per round, suffices to achieve no-internal-regret against an adaptive adversary. Although our PI algorithm is intuitive, the proof that the idea pans out--that PI exhibits NIR against an adaptive adversary--is non-trivial (which may be why Our analytical tools can b e used to establish Young's claim rigorously. 4 Both PI and ICRM can b e construed as b oth incremental conditional regret matching algorithms and as p ower iteration methods. The difference b etween these algorithms is merely the definition of the matrix Pt , and who named them, not what they are named for p er se. 3 Young did not propose this algorithm in the first place). The proof in Hart and Mas-Colell [HMC00] relies on a technical lemma, which states that qt Ptz - qt Ptz-1 1, for some z > 0, is small, whenever all the entries on the main diagonal of Pt are at least some uniform constant. With our new definition of Pt , this condition does not hold. Instead, our result relies on a generalization of this lemma in which we pose weaker conditions that guarantee the same conclusion. Specifically, we require only that the trace of Pt be at least n - 1. Our lemma may be of independent interest. Hence, we have succeeded at defining a simple and intuitive, O(n2 ) per-round online learning algorithm that achieves no internal regret against an adaptive adversary. However, it is not enough to look only at the perround run time of an online learning algorithm. One must also consider the algorithm's convergence rate. It turns out that the convergence rates of PI, ICRM, and HM are all slower than desired (their regret bounds are n / t) O( nt-1/10 )), whereas FV's regret bound is O( (see, for example, Greenwald, et al. [GLM06]). This observation motivates our second algorithm. As our second contribution, we analyze an alternative algorithm, one which is multithreaded. Again, the basic idea is straightforward: one thread plays the game, taking as its mixed action the most-recently computed fixed point, while the other thread computes a new fixed point. Whenever a new fixed point becomes available, the first thread updates its mixed action accordingly. This second algorithm, which we call MT, for multithreaded, exhibits a trade-off between its run time per round and its convergence rate. If p is an upper bound on the number of rounds it takes to compute a fixed n point, MT's regret is bounded by O( p/t). Observe that this regret bound is a function of t/p, the number of fixed points computed so far. If p is small, so that many fixed points have been computed so far, then the run time per round is high, but the regret is low; on the other hand, if p is large, so that only very few fixed points have been computed so far, then the run time per round is low, but the regret is high. This paper is organized as follows. In Section 2, we define online decision problems and no-regret learning precisely. In Section 3, we define the HM, ICRM, and PI algorithms, and report their regret bounds. In Section 4, we introduce our second algorithm, MT, and report its regret bound. In Section 5, we prove a straightforward lemma that we use in the analysis of all algorithms. In Section 6, we analyze MT. In Section 7, we analyze PI. In Section 8, we present some preliminary simulation experiments involving PI, HM, and MT. In Section 9, we describe some interesting future directions. 2 Formalism An online decision problem (ODP) is parameterized by a reward system (A, R), where A is a set of pure actions and R is a set of rewards. Given a reward system (A, R), we let RA denote the set of possible reward vectors. Definition 1 Given a reward system (A, R), an online decision problem can be described by a sequence of reward functions t 1 , where t (At-1 ). ~ t= ~ Given an ODP t t=1 , the particular history Ht = ~ ( a t =1 , t =1 ) corresponds to the agent playing a and observing reward vector (a1 , . . . a -1 ) at all ~ times = 1, . . . , t. In this paper, we restrict our attention to bounded, real-valued reward systems; as such, we assume WLOG that R = [0, 1]. We also assume the agent's pure action set is finite; specifically, we let |A| = n. Still, we allow agents to play mixed actions. That is, an agent can play a probability distribution over its pure actions. We denote by (A) the set of mixed actions: i.e., the set of all probability distributions over A. An online learning algorithm is a sequence of functions qt 1 , where qt : Ht-1 (A) so that qt (h) ~ t= ~ ~ (A) represents the agent's mixed action at time t 1, after having observed history h Ht-1 . When the history h is clear from context, we abbreviate qt (h) by ~ qt . For a given history of length t, let qt be the de^ generate probability distribution corresponding to the action actually played at time t: i.e., for all 1 i n, (qt )i = 1 (at = i).5 Clearly, qt is a random variable. ^ 1 ^ We are interested in measuring an agent's regret in an ODP for playing as prescribed by some online learning algorithm rather than playing otherwise. We parameterize this notion of "otherwise" by considering a variety of other ways that the agent could have played. For example, it could have played any single action a all the time; or, it could have played a every time it actually played a. In either case, we arrive at an alternative sequence of play by applying some transformation to each action in the agent's actual sequence of play, and then we measure the difference in rewards obtained by the two sequences, in the worst case. That is the agent's regret. A transformation of the sort used in the first example above--a constant transformation that maps every action a in the actual sequence of play to a fixed, alternative action a--is called an external transformation. We denote by EXT the set of all external transformations, one per action a A. Many efficient algorithms, with both fast run time per round and fast convergence rates, are known to minimize regret with respect to EXT (e.g., [LW94, FS97, HMC01]). Here, we are interested in transformations of the second type, which are called internal transformations. These transformations can be described by the following set of n-dimensional matrices: INT = { (a,b) : a = b, 1 a, b n } where ((a,b) )ij = 1 1 0 EXT and INT are the two best-known examples of transformation sets. More generally, a transformation can be any linear function from (A) (A). In the definitions that follow, we express reward vectors as column vectors, mixed actions q as row vectors, and transformations as n-dimensional matrices. If, at time , an agent plays mixed action q in an ODP with reward vector , the agent's instantaneous regret (r ) with respect to a transformation is the difference between the rewards it could have obtained by playing q and the rewards it actually obtained by playing q : i.e., (r ) = q - q (1) For example, if |A| = 4, then applying the following transformation to a pure action a yields the third action if a is the second action, and a otherwise: 1000 0 0 1 0 (2,3) = 0 0 1 0 0001 The agent's cumulative regret vector (Rt ) through time t is then computed in the obvious way: for , (Rt ) = t (r ) (2) =1 One can also define pure action variants of the instantaneous and cumulative regret vectors, as follows: (r ) = q - q ^ ^ ^ and ^ (Rt ) = t (r ) ^ (3) (4) =1 One can bound either the expected pure action regret or the (mixed action) regret. To avoid unilluminating complications, we focus on the latter in this work. Our ob jective in this work is to establish sublinear bounds on the average internal-regret vector of various online learning algorithms. Equipped with such bounds, we can then go on to claim that our algorithms exhibit no internal regret by applying standard techniques such as the Hoeffding-Azuma lemma (see, for example, CesaBianchi and Lugosi [CBL06]). Note that we cannot establish our results for general . We defer further discussion of this point until Section 9, where we provide a simple counterexample. For completeness, here is the formal definition of no-regret learning: Definition 2 Given a finite set of transformations , an online learning algorithm qt 1 is said to exhibit ~ t= no--regret if for al l > 0 there exists t0 0 such that for any ODP t 1 , ~ t= < 1 ^ Pr t > t0 s.t. max Rt (5) t if i = a i = j if i = a j = b otherwise 1 0 if p . otherwise 5 For predicate p, 1 (p) = 1 The relevant probability space in the above definition is the natural one that arises when considering a particular ODP t 1 together with an online learning algorithm ~ t= qt t=1 . The universe consists of infinite sequences of ~ pure actions a 1 and the measure is defined by the = learning algorithm. We close this section with some notation that appears in later sections: · We let a · b = aT b denote the dot product of column vectors a and b. · For vector v Rn , we let v + denote the componentwise max of v and the zero vector: i.e., (v + )i = max(vi , 0). Algorithm 1 HM [HMC00] Initialize a1 to be an arbitrary pure action. During each round t = 1, 2, 3, . . .: 1. Play the pure action at . 2. For all j , let (qt )j = 1 (at = j ). ^ 1 3. Observe rewards t . ^ 4. Update the regret vector Rt . ^ 5. Let the regret matrix Pt = ^ ^ Nt +(”t-Dt )I . ”t 3 Algorithms 6. Sample a pure action at+1 from qt Pt . ^^ We begin this section by describing HM, the simple adaptive procedure due to Hart and Mas-Colell [HMC00] that exhibits no internal regret against a nearly oblivious adversary, as well as ICRM, a variant of HM due to Young [You04] that exhibits no internal regret against an adaptive adversary. We then go on to present a simple variant of these algorithms, which we call PI, for power iteration, for which we establish the stronger of these two guarantees. Definition 3 Define the n-dimensional matrix + Nt = (Rt ) INT Algorithm 2 Power Iteration Initialize q1 to be an arbitrary mixed action. During each round t = 1, 2, 3, . . .: 1. Sample a pure action at from qt . 2. Play the pure action at . 3. Observe rewards t . 4. Update the regret vector Rt . 5. Define the regret matrix Pt = Nt Dt . and the scalar Dt = + (Rt ) INT 6. Set the mixed action qt+1 qt Pt . At a high-level, HM (Algorithm 1) and ICRM (not shown) operate in much the same way: at each time step t, an action is played and a reward is earned; then, the regret matrix Pt is computed in terms of Nt and Dt , based on which a new action is derived. But the algorithms differ in an important way: specifically, they differ in their "state" (i.e., what they store from one round to the next). In HM, the state is a pure action, so that during each round, the next pure action is computed based on the current pure action. In ICRM, the state is a mixed action. Like Young's algorithm, the state in our algorithm, PI (Algorithm 2), is a mixed action. But, our algorithm differs from both of the others in our choice of the matrix regret Pt . In PI, Pt = Nt /Dt , which is the same matrix as in Greenwald et al. [GJMar], for example. Intuitively, Nt /Dt is a convex combination of the transformations in INT , with each INT weighted by the amount of regret the learner experienced for not having transformed its play as prescribed. In HM and ICRM, Pt is a convex combination of Nt /Dt and the identity matrix. This convex combination depends on a parameter ”, which is an upper bound on the regret per round; typically, ” = 2n. HM has a per-round run time linear in the number of pure actions because it updates one row of the regret matrix during each round, namely that row corresponding to the action played. ICRM and PI both have per-round run times dominated by the matrix-vector multiplication in Step 6, and are hence quadratic in the number of pure actions. We analyze PI (Algorithm 2) in this paper, and obtain the following result: Theorem 4 PI (Algorithm 2) exhibits no internal regret against an adaptive adversary. Specifical ly, the bound on its average regret is as fol lows: for al l times t, R+ t O( nt-1/10 ) t A slight variant of our analysis shows that ICRM has the same bound as PI. Algorithm 1 was previously analyzed by Cahn [Cah04], who showed that if the adversary is nearly oblivious, then HM exhibits NIR. One can combine ideas from Hart and Mas-Colell [HMC00] and our analysis of PI to show that against an oblivious adversary, the bound on HM's average regret is as follows: for all times t, ^ + Rt E O( nt-1/10 ) t Theorem 4 implies, by the Proposition at the bottom of page 1133 in Hart and Mas-Colell [HMC00], that like HM and ICRM, PI also converges to the set of correlated equilibria in self-play. Note that it is also possible to define a variant of PI with Pt = Nt /Dt , which like HM, uses the agent's pure action as its state. We conjecture that this algorithm would exhibit no internal regret against an oblivious (or nearly oblivious) adversary, but do not analyze it because it has no obvious advantages over HM or PI. Algorithm 3 Multithreaded no-internal-regret learning algorithm. Initialize R, N , D to zero. First thread: During each round t = 1, 2, 3, . . .: · Wait until it is time to take an action. · Get the most up-to-date fixed point computed by the other thread. Call it qt . (If no fixed point has been computed yet, initialize qt arbitrarily.) · Sample a pure action at from qt . · Play the pure action at . · Observe rewards t . · Update the regret vector Rt . Second thread: Repeat forever: · Wait until the other thread updates the regret vector R for any > 0. · Get a copy of R from the other thread. · Compute N and D . · Compute a fixed point of N /D . · Pass this fixed point to the other thread. 4 Multithreaded algorithm The ICRM and PI algorithms have better per-round run times than standard NIR learning algorithms, but their convergence rates are far worse. Moreover, these algorithms are inflexible: they cannot expend additional effort during a round to improve their convergence rates. In this section, we present a parameterized, multithreaded algorithm (MT) that smoothly trades off between perround run time and regret. The idea underlying MT is simply to spread the computation of a fixed point over many time steps, and in the mean time to play the most recent fixed point computed so far. This idea is formalized in Algorithm 3, in which there are two threads. One thread plays the game, taking as its mixed action the most-recently computed fixed point; the other thread works towards computing a new fixed point. Theorem 5 Let p 1 be an upper bound on how many time steps it takes to compute a fixed point. MT (Algorithm 3) has per-round run time O(LS (n)/p + log n + n ) and regret bound O( p/t), where LS (n) required solve a linear system of equations expressed as an ndimensional matrix and is the usual ly negligible run time required to maintain the regret vector (see Section 6). More precisely, for al l times t, ( R+ n - 1)(4p - 3) t t t Suppose you are playing a game every minute and you have just barely enough computational resources to find a fixed point in the time alloted with a standard NIR learning algorithm (p = 1). Further, suppose that it takes 1 day of playing for your regret to fall below a desired threshold. Now, suppose the game changes and you now have to make a move every second. If you set p = 60, and continue to compute 1 fixed point per minute, this will require 4 · 60 - 3 4 · 60 times more rounds to achieve the same level of regret. But each round is 60 times faster, so the wall-clock time for the same level of regret has increased by a factor of about 4, to a bit under 4 days. With one extreme parameter setting, namely p = 1, MT is just like a standard NIR learning algorithm, and hence has run time O(n3 ) per round and regret bound n O( /t). With another extreme parameter setting, namely p = n3 , MT has run time O(log n) per round (as long as regret can be calculated quickly; see the end of Section 6) and regret bound O(n2 / t). The intermediate parameter setting p = n yields an O(n2 ) run time per round and an O(n/ t) regret bound. This algorithm, therefore, dominates both PI and ICRM, achieving the same run time per round, but a better regret bound, for all values of t n5/4 . 5 General Analysis In this section, we derive a key lemma that is used in both our analyses. Specifically, we bound the L2 -norm of the regret vector at time t in terms of two summations from time = 1 to t. Each term in the first bounds how close the mixed action played at time is to being a fixed point of the regret matrix at some previous time - w( ). Each term in the second bounds the regret that could ensue because the mixed action played at time is out of date. Lemma 6 For any online learning algorithm and any function w(·) > 0, we have the fol lowing inequality: for all times t > 0, R+ 2 t 2 2 t Now if we apply Lemma 8 and sum over time, this yields: q N -w ( ) =1 - D -w ( ) I + (n - 1) t t =1 (2w( ) - 1)) t R =1 + 2 2 where qt is the mixed action at time t, and I is the identity matrix. We prove this lemma using two preliminary lemmas. The first involves simple algebra. Lemma 7 For any two vectors a, b Rd , with d 1, we have the fol lowing inequality: ||(a + b)+ ||2 ||a+ ||2 + 2a+ · b + ||b||2 2 2 2 2 t 2 - R+ -1 2 2 =1 q (N -w( ) - D -w( )I ) t (2w( ) - 1) (8) + (n - 1) =1 The summation on the left hand side of this equation R+ 2 R+ 2 R+ 2 collapses to t 2- t 2 , and the lemma 0 2= is proved. (6) 6 Analysis of MT Pro of: Both ||·||2 and dot products are additive componentwise, so it suffices to assume a, b are real numbers. If a + b 0 then |a+ |2 + 2a+ b + b2 = (a+ + b)2 0 = |(a + b)+ |2 . If a + b > 0 then a+ + b a + b = (a + b)+ > 0. Thus (a+ + b)2 |(a + b)+ |2 . Lemma 8 For any learning algorithm and any t > 0, we have the fol lowing equality: + rt · R = = qt (N - D I ) t N -I D qt t D Equipped with Lemma 6, the proof of Theorem 5 is quite simple. Pro of: [Proof of Theorem 5] For general p, the fixed points may be based on out-of-date regret vectors, but they are never very out of date. Once the fixed point is computed, it is based on data that is p rounds out of date. That fixed point is then used for another p rounds while a replacement is computed. Overall, the fixed point played at time t can be based on a regret vector no more than 2p rounds old. More precisely, the such that R is used to compute qt satisfies t-(2p-1) t-1. Now apply Lemma 6 letting w( ) be the age of the regret vector used by the second thread in calculating q . Since q is a fixed point of N -w( )/D -w( ), it follows that q (N -w( ) - D -w( )I ) = 0. Thus, R+ 2 t 2 Pro of: Standard no-regret arguments about the fixed point (e.g., Theorem 5 in [GLM06]). Note that if qt is a fixed point of N /D , as in is in FV and MT for appropriate choices of , then rt · + R = 0. For example, in the traditional algorithm FV, + rt · Rt-1 = 0. Pro of: [Proof of Lemma 6] Fix a {1, . . . , t}. By def+ + inition, R = R -1 + r . Hence, by applying Lemma 7, R+ 2 +2 we obtain a linear approximation of ||R ||2 - -1 2 with an error term: R+ 2 R+ 2 2 + || 2 -1 2 + 2r · R -1 + | r | 2 R+ 2 + = -1 2 + 2r · R -w ( ) 2 = (n - 1)t(4p - 3) R+ 2 t t =1 0 + (n - 1) t =1 (2(2p - 1) - 1) Therefore, R+ t t t ( n - 1)(4p - 3) t and the theorem is proved. A našve computation of the regret vector would limit i the per-round run time of PI to (n2 ). For applications where p is O(n) (or less), this is not a bottleneck, because in that case the O(n3 /p) bound on the run time of the fixed point computation is larger than the O(n2 ) 2 + + + 2r · (R -1 - R -w( ) ) + ||r ||2 run time of the regret vector updates. R+ 2 If the ODP is a repeated game where the opponents + -1 2 + 2r · R -w ( ) have O(n) joint actions, an agent can simply record the opponents' actions each round in constant time, and + 2(w( ) - 1)(n - 1) + (n - 1) R+ 2 then update the regret vector right before solving for a + = -1 2 + 2r · R -w ( ) + (2w( ) - 1)(n - 1) fixed p oint; this up date takes time O(n3 ). In this case, (7) if p = n3 , then MT's per-round run time is O(log n). For general ODPs, where the reward structure may 2 change arbitrarily from one round to the next, keepThe second inequality follows from the fact that ||r ||2 ing track of regret in time o(n2 ) per round seems to (n - 1). require random sampling (i.e., bandit techniques; see, for example, Auer et al. [ACBFS02]). We leave further investigation of this issue to future work. Choosing a random action from a probability distribution using a binary search requires (log n) time, so ODPs that require extremely quick decisions cannot be handled without further innovation. q (Nt - Dt I ) T-1 = Dt qt PT =t ( Pt - I ) ( ( 7 Analysis of PI = Dt qt PtW (Pt - I ) + - T-1 PtW Dt qt PT = =t P Dt qt tW +1 Pt - I ) In this section, we analyze PI. By construction, q is not a fixed point but only an approximate fixed point, so q (N -w( ) - D -w( )I ) = 0. Instead, we will show the following: Dt T-1 PT =t - PtW - PtW + Pt - I ) (9) Lemma 9 For al l times > 0 and 0 < w( ) < , n D q 1 = O + n(w( ))2 (N -w ( ) - D -w ( ) I ) w ( ) eferring the proof of Lemma 9, we first show how to use Lemmas 6 (choose w( ) = 2/5 ) and 9, to analyze P I: R+ 2 t 2 2 t q =1 N -w ( ) - D -w ( ) I + (n - 1) = = = t O =1 n + n( 2/5 )2 2/5 t =1 (2 2/5 - 1)) + (n - 1)O(t7/5 ) We will bound the two terms in Equation 9 in turn. Beginning with the first, the quantity qt PtW can be interpreted as the distribution of a Markov chain with transition matrix Pt and initial distribution qt after W time steps. Most Markov chains converge to a stationary distribution, so Pt is intuitiveli plausible that the i y related quantity qt tW +1 - PtW s small. The following lemma, which verifies this intuition, is a strengthening of statement M7 in Hart and Mas-Colell [HMC00]. Our lemma is stronger because our premises are weaker. Whereas their lemma requires that all the entries on the main diagonal of Pt be at least some uniform constant, ours requires only that the sum of Pt 's diagonal entries (i.e., its trace) be at least n - 1. The latter of these two conditions (only) is satisfied by PI's choice of Pt , because each Pt is a convex combination of internal regret transformations/matrices, each of which has trace n - 1. Lemma 10 For al l z > 0, if P is n-dimensional stochastic matrix that is close to the idenqity matrix in 1the t n sense that i=1 Pii n - 1, then (P z - P z-1 ) = O(1/ z ) for al l n-dimensional vectors q with ||q ||1 = 1. Pro of: See Appendix. Now, we can easily bound Dt = D -W by (n - 1)( - W ) n , so the first term in Equation 9 is bounded above by O(n / W ). The following lemma bounds the second term in Equation 9: Lemma 11 For al l times > 0 and 0 < w( ) < , 1 q ( - T-1 Pt - I ) = O(nW 2 /Dt ) PtW PT t =t O(nt9/5 ) + (n - 1)O(t7/5 ) O(nt9/5 ) Taking square roots and dividing by t proves Theorem 4: R+ 2 t t O( nt-1/10 ) It remains to prove Lemma 9. For the remainder of this section, we use the shorthands W w( ) and t = - w( ). We begin to analyze q (Nt - Dt I ) by rewriting this expression as the sum of two terms. The first, which would be zero if power iteration converged in W steps, is provably small. The second measures how the matrices PT change over time; if all the PT 's were eq al, this u , -1 PT term would be zero. Noting that q = qt T =t where each PT = NT /DT , we derive the two terms as follows: The proof of this lemma makes use of the following definition and related fact: the induced L1 -norm of a matrix M is given by ||M ||1 = max v =0 ||v M ||1 ||v ||1 (10) and for any n-dimensional vector v and matrix M , ||v M ||1 ||v ||1 ||M ||1 Proof: Since ||Pt - I ||1 ||Pt ||1 + ||I ||1 = 2, it follows that qW 1 ( s -1 W Pt+s - Pt Pt - I ) t =0 2 q t W -1 s =0 Pt+s - q t PtW 1 Hence, it suffices to bound To do so, we first note that W -1 s =0 -1 s=0 W Pt+s - PtW 1 . The inequality in this derivation follows from the triangle inequality. Only n - 1 of the n(n - 1) internal transformations affect any particular action and rewards are between 0 and 1, so |Dt+1 - Dt | is bounded by n - 1. The induced L1 -norm of a matrix is the maximum row sum, after taking the absolute value of each entry; hence, ||Nt+1 - Nt ||1 is bounded by n - 1. Further, ||Nt+s ||1 Dt+s , |Dt+s - Dt | s(n - 1), and ||Nt+s - Nt ||1 s(n - 1), so we conclude that ||Pt+s - Pt ||1 2s(n - 1)/Dt . Summing over s from 0 to W - 1 yields the desired O(nW 2 /Dt ). Pt+s - PtW W -1 s =0 8 Pt+u PtW -s-1 - Pt+u (Pt+s - s u-1 Experiments = = Next, we multiply both sides of Equation 11 by qt and take the L1 -norm. Then, we apply Equation 10 and the facts that ||qt ||1 = 1 and ||Pt+s ||1 = 1, for all s = 0, · · · , W - 1, to obtain the following: qW 1 s -1 W Pt+s - Pt t =0 W -1 s =0 s u-1 u =0 s Pt+u PtW -s ( 11) =0 Pt )PtW -s-1 =0 = = The first inequality in the above derivation follows from the triangle inequality. The second follows from the fact that the norm of a product is bounded above by the product of the norms. To understand the final quantity (Equation 12) intuitively, consider two coupled Markov chains, one of which uses Pt as its transition matrix, and the other of which uses Pt+s . These Markov chains lead to different distributions to the extent that they have different transition matrices. Since Pt+s = Nt+s /Dt+s , it follows that: ||Pt+s - Pt ||1 N 1 Nt t+s = - Dt+s Dt N 1 ||Nt+s - Nt ||1 Nt+s t+s + - Dt+s Dt Dt |Dt+s - Dt | ||Nt+s - Nt ||1 = ||Nt+s ||1 + (Dt+s Dt ) Dt W -1 s =0 W -1 s =0 W -1 s =0 W -1 s =0 qt q t s u-1 s u-1 Pt+u Pt+u =0 ( Pt+s - Pt )PtW -s-1 1 =0 ( Pt+s - Pt )PtW -s-1 | 1 W -s-1 ||qt ||1 s u-1 =0 ||Pt+u ||1 |Pt+s - Pt ||1 ||Pt ||1 9 We ran some simple experiments on the repeated Shapley game to see whether the theoretical bounds we derived match what is observed in practice. An instance of the internal regret-matching (IRM) algorithm6 of Greenwald et al. [GLM06] was played against PI, HM with ” = 5, and MT with p = 10. Our results are plotted in Figures 1, 2, 3 and 4 (the fourth figure summarizes all our results). Each experiment was repeated 50 times, and each ensuing data series is plotted with two lines, delimiting the 95% confidence interval. The "true" line corresponding to infinitely many runs probably lies somewhere between the two plotted lines. Note thlogarithe mic scales of the axes, so powers such as 1/ t appear as straight lines. What we observe is twofold: (i) PI does much better in practice than it does in theory, achieving better performance than HM and MT (see Figure 4); and (ii) MT does substantially worse than IRM, with the ratio sim4 ilar to the (10) - 3 6 predicted by theory. Discussion ||Pt+s - Pt ||1 (12) Standard no-internal-regret (NIR) algorithms rely on a fixed point computation, and hence typically require O(n3 ) run time per round of learning. The main contribution of this paper is a novel NIR algorithm, which is a simple and straightforward variant of a standard NIR algorithm, namely that in Greenwald [GJMar]. Rather than compute a fixed point every round, our algorithm relies on power iteration to estimate a fixed point, and hence runs in O(n2 ) time per round. One obvious question that comes to mind is: can power iteration be used in algorithms that minimize regret, for arbitrary ? The answer to this question is no, in general. For example, consider an ODP with two actions, and only one action transformation , which swaps the two actions: i.e., A 0 1 = 10 standard -regret-minimizing algorithm would play the fixed-point of this matrix, which is uniform randomization. However, PI would learn a predictable sequence This algorithm is a close cousin of FV, and has the same regret b ound. 6 Max Average Internal Regret 1 Power Iteration Standard NIR Algorithm 1 Max Average Internal Regret Hart Mas-Colell Standard NIR Algorithm 0.1 Max Regret Max Regret 0.01 0.1 0.01 0.001 1 10 100 Rounds 1000 10000 0.001 1 10 100 Rounds 1000 10000 Figure 1: IRM and PI playing Shapley. 95% confidence interval of average of 50 runs shown. of mixed actions, namely q , 1 - q , q , 1 - q , . . .. Since an adversary could easily exploit this alternating sequence of plays, the idea does not immediately apply to arbitrary . The part of the proof that is specific to INT is Pt having trace n - 1, allowing us to use Lemma 10. Another related question is: can power iteration be used in other NIR algorithms? For example, CesaBianchi and Lugosi [CBL03] and Greenwald et al. [GLM06] present a class of NIR algorithms, each one of which is based on a potential function. Similarly, Blum and Mansour [BM05] present a method of constructing NIR learners from no-external-regret (NER) learners. We conjecture that the power iteration idea could be applied to any of these NIR algorithms, but we have not yet thoroughly explored this question. Our admittedly limited experimental investigations reveal that perhaps PI's convergence rate in practice is not as bad as the theory predicts, but further study is certainly warranted. Another interesting question along the same lines is: would another iterative linear solving method, specifically one that is more sophisticated than power iteration, such as biconjugate gradient, yield better results, either in theory or in practice? Figure 2: IRM and HM with ” = 5 playing Shapley. 95% confidence interval of average of 50 runs shown. part by the Sloan Foundation. References [ACBFS02] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The nonstochastic multiarmed bandit problem. Siam J. of Computing, 32(1):48­77, 2002. [BM05] A. Blum and Y. Mansour. From external to internal regret. In Proceedings of the 2005 Computational Learning Theory Conferences, pages 621­636, June 2005. [Cah04] A. Cahn. General procedures leading to correlated equilibria. International Journal of Game Theory, 33(1):21­40, Decemb er 2 0 0 4 . [CBL03] N. Cesa-Bianchi and G. Lugosi. Potentialbased algorithms in on-line prediction and game theory. Machine Learning, 51(3):239­ 261, 2003. [CBL06] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. [CLRS01] Cormen, Leiserson, Rivest, and Stein. Introduction to Algorithms, chapter 28, pages 757­758. MIT Press, 2nd edition, 2001. [CW87] D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. In STOC '87: Proceedings of the nineteenth annual ACM Symposium on Theory of Computing, pages 1­6. ACM Press, New York, NY USA, 1987. Acknowledgments We are grateful to Dean Foster for originally suggesting that we try out power iteration in our experiments with regret-minimizing algorithms. We are also grateful to Casey Marks for providing much of the code for our experiments and to Yuval Peres for assistance simplifying the proof of Lemma 10. This research was supported in Max Average Internal Regret 1 Multithreaded with p=10 Standard NIR Algorithm 1 Max Average Internal Regret 0.1 Max Regret Max Regret 0.01 0.1 0.01 Hart Mas-Colell Multithreaded with p=10 Power Iteration 0.001 1 10 100 Rounds 1000 10000 0.001 1 10 100 Rounds 1000 10000 Figure 3: IRM and MT with p = 10 playing Shapley. 95% confidence interval of average of 50 runs shown. [Str69] [FS97] Y. Freund and R. E. Schapire. A decisiontheoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119­ 139, 1997. D. Foster and R. Vohra. Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21:40­55, 1997. A. Greenwald, A. Jafari, and C. Marks. A general class of no-regret algorithms and game-theoretic equilibria. In Amitabha Gupta, Johan van Benthem, and Eric Pacuit, editors, Logic at the Crossroads: An Interdisciplinary View, volume 2. Allied Publishers, To Appear. A. Greenwald, Z. Li, and C. Marks. Bounds for regret-matching algorithms. In Proceedings of the Ninth International Symposium on Artificial Intel ligence and Mathematics, 2006. S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68:1127­1150, 2000. S. Hart and A. Mas-Colell. A general class of adaptive strategies. Journal of Economic Theory, 98(1):26­54, 2001. Torgny Lindvall. Lectures on the Coupling Method, chapter II.12, pages 41­47. Wiley, 1992. N. Littlestone and M. K. Warmuth. The weighted ma jority algorithm. Information and Computation, 108:212 ­ 261, 1994. Figure 4: Summary of Figures 1, 2 and 3. Volker Strassen. Gaussian elimination is not optimal. Numer. Math., 13:354­356, 1969. P. Young. Strategic Learning and its Limits. Oxford University Press, Oxford, 2004. [You04] [FV97] [GJMar] A Proof of technical Lemma 10 [GLM06] Note: in this proof, we use N , M and t for meanings unrelated to those in the main body of the paper. Don't be confused. Let M be an n by n matrix which is row-stochastic (vectors should be multiplied on the left as in q M ) and has trace at least n - 1. We want to show: q 1 max (M W - M W -1 ) = O(1/ W ). ||q||1 =1 [HMC00] [HMC01] [Lin92] [LW94] For any two probability measures ” and on probability space , their total variation distance is defined to be 1a ||” - ||T V = |”(a) - (a)|. (13) 2 We denote by ”(X ) the distribution of random variable X . Let Q(t) denote the state of a Markov chain with transition matrix M and initial distribution q after t time steps. Our desired conclusion can be recast in Markov chain language as ||”(Q(W )) - ”(Q(W - 1))||T V = O(1/ W ) for all initial distributiions q . mi,i n - 1 where mi,j is the We know that element of the ith row and jth column in matrix M . All mi,i are at most 1, so there can be at most one state, call it p, satisfying mp,p < 1/2. If no such state exists the Lemma was already shown as step M7 of [HMC00], so assume it exists.7 We define a new matrix N as the unique solution to mi,j = 1 (ni,j + i,j ) if i = p, and 2 mp.j = np,j otherwise, where i,j = 1 if i = j and 0 otherwise.8 It is easy to check N is row stochastic and is therefore the transition matrix of some Markov Chain. For simplicity we denote by N the Markov Chain with transition matrix N and initial distribution q . Let B (0), B (1), ... be the random walk associated with Markov Chain N . One can easily create a random walk of the Markov chain M from the walk of N as follows. If the current state is the special state p, do what the N -chain did. Otherwise, flip a coin, following N with probability 50% and remaining at the same state otherwise. Formally, define index It inductively by I0 = 0 and It = It-1 +a symmetric Bernoulli distribution, if B (It-1 ) = p, and It = It-1 + 1 otherwise. Then it can be shown that Q(t) = B (It ) has distribution q M t . We need to show the distributions of B (It ) and B (It-1 ) are close to each other. For i 0, define Xi = #{ t 0 : It = i }. Intuitively, Xi is the number of steps the M chain takes while the N chain is in state B (i). Our proof hinges on the easily seen fact that the Xi s are independent random variables. If B (i) = p, Xi = 1 and if B (i) = p, Xi is geoi-1 metrically distributed with mean 2.9 Let Ti = j =0 Xi . One can see that IW = max{ i : Ti W }. In order to prove the conclusion we need the following two lemmas. Lemma 12 Let f (y , z ) be a function, Y , Y and Z be random variables with Z pairwise independent of Y and Y . Let I = f (Y , Z ) and I = f (Y , Z ) be random variables. Then ||”(I ) - ”(I )||T V ||”(Y ) - ”(Y )||. Chernoff bound therefore shows that the event E that IW < W/3 is exponentially unlikely. Condition on the path B (0), B (1), . . . and event E not happening. We can i therefore write IW = max{ i : j =W/3+1 Xi W - TW/3 } i and IW -1 = max{ i : j =W/3+1 Xi W - 1 - TW/3 }. Now by Lemma 12, associating Z with the vector-valued random variable {Xi } W/3+1 , Y with W - TW/3 and i= Y with W - 1 - TW/3 , we see that it suffices to bound the total variation distance between TW/3 and TW/3 + 1. Define k = #{ 0 i W/3 : B (i) = b }. Every visit to b is followed by a visit to another state with probability at least 1/2, so with exponentially high probability over the choices of the B (i), k W/12. Condition on the event k W/12. By definition TW/3 = i (W/3 - k ) + :B (i)=b Xi , so it suffices to analyze the variation distance between the sum of k W/12 geometric random variables and the same shifted by 1. By Lemma 13, this is O(1/ W ). Therefore we obtain ||”(IW ) - ”(IW -1 )||T V = O(W -1/2 ) (15) Back to M chain B (IW ) we could have a P r(B (IW ) = i) = P r(B (IW ) = i|IW = a)P r(IW = a) = a P r(B (a) = i)P r(IW = a) and similarly P r(B (IW -1 ) = i) = Thus a P r(B (a) = i)P r(IW -1 = a) (14) Pro of: The total variation distance between ”(X ) and ”(X ) is equal to minimum possible probability that X = X over couplings of X and X . A coupling of Y and Y can be extended into a coupling of I and I trivially. k Lemma 13 Let Sk = i=1 Xi where Xi 's are independent identical ly distributed random variables with geometric distributions with mean 2, then ||”(Sk ) - ”(1 + Sk )||T V = O(n-1/2 ). Pro of: Equation II.12.4 of [Lin92]. The probability that IW < W/3 is bounded by the probability that the sum of W Bernoulli random variables with mean 1/2 is less than W/3. A standard 7 One can see that the current proof also holds if no such state exists. 8 I.e., ni,j = 2mi,j - i,j if i = p, and np.j = mp,j otherwise. 9 Note that our "geometrically distributed" variables have supp ort 1, 2, . . ., so Pr (Xi = k) = (1/2)k for k 1. ||”(B (IW ) - ”(B (IW -1 )||T V 1i |P r(B (IW ) = i) - P r(B (IW -1 ) = i)| = 2 1i a P r(B (a) = i)|P r(IW = a) - P r(IW -1 = a)| 2 1a |P r(IW = a) - P r(IW -1 = a)| = 2 = = ||”(IW ) - ”(IW -1 )||T V O(W -1/2 ) This concludes the proof. We remark that this lemma can also be proved in a more self-contained manner via a Markov chain coupling. The motivating story follows. Charlie and Eve are walking drunkenly between the n neighborhood bars. If Charlie is in a good bar, each time step he first flips a coin to decide whether or not he should leave that bar. If he decides to leave, he then makes a probabilistic transition to some bar (perhaps the same one). If Charlie is in a bad bar, he always leaves. Eve starts one time step later than Charlie at the same initial bar. Eve makes her decision to leave or not independently of Charlie, but reuses Charlie's choices of where to go next. However, if Eve ever catches up with Charlie, she switches to just following him around. A natural question to ask is how likely Eve and Charlie are to be at the same bar after t time steps? Note that if you look at Eve's motions and ignore Charlie's, she behaves exactly like Charlie does. The connection to the present lemma is that Charlie's distribution corresponds to M t and Eve's to M t-1 . Standard arguments relating total variation distance to couplings show that if Eve and Charlie usually finish at the same bar, their probability distributions must be quite similar.