Optimal Strategies from Random Walks Jacob Abernethy Division of Computer Science UC Berkeley jake@cs.berkeley.edu Manfred K. Warmuth Department of Computer Science UC Santa Cruz manfred@cse.ucsc.edu Joel Yellin Division of Physical and Biological Sciences UC Santa Cruz yellin@soe.ucsc.edu Abstract We analyze a sequential game between a Gambler and a Casino. The Gambler allocates bets from a limited budget over a fixed menu of gambling events that are offered at equal time intervals, and the Casino chooses a binary loss outcome for each of the events. We derive the optimal min-max strategies for both participants. We then prove that the minimum cumulative loss of the Gambler, assuming optimal play by the Casino, is exactly a well-known combinatorial quantity: the expected number of draws needed to complete a multiple set of "cards" in the generalized Coupon Collector's Problem. We show that this quantity and the optimal strategy of the Gambler can be efficiently estimated from a simple random walk. 3. At the end of each dayi, the Gambler leaves the Casino having lost w ˇ = wi i and the cumulative loss of the gambler is updated as L L + w ˇ . The Gambler also monitors the cumulative performance of each event with a state vector s Nn , where si is the current total loss of event i. After incurring loss at the current day, the state vector is updated to s s + . 4. The Gambler stops playing as soon as he observes that each even has suffered more than k losses, where k is some fixed positive integer known to both. The Casino is aware of this decision and behaves accordingly. Gambling against a casino may seem an unlikely starting point for a model of sequential decision making ­ we generally consider the typical environment for learning to be stochastic rather than adversarial. Yet are these two environments necessarily incompatible? Among the objectives of this paper is to address questions such as: "What will be the Gambler's worst-case cumulative loss?"; and "What is the optimal betting strategy?" These questions, while clearly game-theoretic, are ultimately answered here by considering a randomized Casino rather than an adversarial one. From this perspective, randomness may indeed be the Gambler's worst adversary. Early work on sequential decision making focused on the problem of predicting a binary outcome given advice from a set of n experts. In that setting, the goal of the predictor is to combine the predictions of the experts to make his own prediction, with the objective of performing well, in hindsight, compared to the best expert. The performance of both the learner and the experts is measured by a loss function that compares predictions to outcomes. One of the early algorithms, the Weighted Majority algorithm [LW94], utilizes a distribution corresponding to the degree of trust in each expert. It was observed by Freund and Schapire [FS97] that the analysis of the Weighted Majority algorithm can be applied to the so-called hedge setting. Rather than predict a binary outcome, the learner now plays some distribution over the experts on every round, a loss value is assigned to each expert independently, and the learner suffers the expected loss according to his chosen distribution. In this case, the learner bears the exact burden of the Gambler - that of "hedging" his bets so as to minimize his cumulative loss. To emphasize that the Gambler/Casino game is useful for settings other than prediction, we use the term "event" rather than "expert". 1 Introduction This paper analyzes the problem of sequential prediction and decision making from the perspective of a two player game. The game is played by a learner, called here the Gambler, who makes a sequence of betting decisions. The Gambler's opponent is the Casino in which he plays. Gambler vs. Casino: 1. On each day, the Gambler arrives at the Casino with $1. The Casino presents n events and each event is played once per day. The Gamblew chooses a distribution vecr tor w [0, 1]n , where i = 1, and bets the portion wi of his $1 budget on event i. 2. On each day the Casino determines the outcome of each event with the objective of winning as much money from the Gambler as possible. In particular, after observing the distribution of the Gambler's bets the Casino decides between a loss or a no loss for all daily events. These choices are summarized by a loss vector {1, 0}n where i = 1 implies that on event i, the Gambler lost. (For simplicity, we assume the only relevant quantities are losses. By shifting our baseline we can model wins as non-losses). Supported by DARPA grant FA8750-05-2-0249 and NSF grant DMS-0707060. Supported by NSF grant CCR 9821087. A central theme of much of the sequential decision making literature is the use of so-called "exponential weights" to determine the learner's distribution on each round. Use of the exponential weighting scheme in the case of the Casino game results in the following strategy for the Gambler: At a state s, bet si wi = j sj on event i, (1) where the factor lies in [0, 1). From the analysis of the Weighted Majority algorithm it follows that the cumulative loss of the Gambler using the above strategy is bounded by 1 ln n + k ln 1- . Under the assumption that the loss of the best event is at most k , the factor can be tuned [FS97] so that the above bound becomes k + 2k ln n + ln n. The exponential weights framework, as well as other online learning techniques, can be motivated using the method of relative entropy regularization [KW99]. While the resulting algorithms are elegant and in some cases can be shown to be asymptotically optimal [CBFHW96], they do not optimally solve the underlying game. Some improvements have been made using, for example, binomial weights that lead to slightly better but still non-optimal solutions [CBFHW96] in a setting where the experts must produce a prediction. While it is formally easy to define the optimal algorithms using minimax expressions, it has generally been assumed that actually computing an efficient solution is quite challenging [CBFH+ 97]. More recently, however, a minimax result [ALW07] was obtained for the specific game of prediction with absolute loss. The resulting algorithm, Binning, is efficient and optimal in a slightly relaxed setting. In this paper we show that the minimax solution to the Gambler/Casino prediction game, which is identical to the underlying game of the hedge setting with binary losses, can be obtained efficiently. In addition, the game can be fully analyzed using a simple Markov process: a random walk on an n-dimensional lattice. The value of the game, that is the cumulative loss of an optimal Gambler, can be interpreted as the expected length of such a random walk. The Gambler's optimal play, the portion of his budget he should bet on a given event, can similarly be interpreted as manifesting an assessment of the probability of a specific random outcome of this walk. The game's stopping criterion, that is when all events have lost at least k + 1 times, may seem unusual at first yet fits quite naturally within the experts framework. Indeed, online learning bounds are often tuned with an explicit a priori knowledge of the cumulative loss of the best expert, which here would be k 1 . While perhaps not realistic in pracStrictly speaking, in the expert setting it is assumed that at least one expert has not crossed the k-mistake threshold, while here we stop the Gambler/Casino game when the loss of the last expert/event goes beyond this threshold. It is easy to show that this slight modification, made for convenience, increases the worst-case loss of the Gambler by exactly 1. 1 tice, k can be estimated and various techniques such as successive doubling can be used to obtain near-optimal bounds [CBFH+ 97]. The paper is structured as follows. In Section 2 we give a minimax definition of the optimal value of the game considered here. In Section 2.1 we modify the game by restricting the adversary's choices to unit loss vectors. In Section 3, we then turn our attention to a specific Markov process with a number of relevant properties. We apply this randomized approach to the Casino game in Section 4, where we prove our main results. In Section 5 we give recurrences and exact formulas, based on sums over multinomials, for the value of the game and for the optimal probabilities. We set out an efficient method to compute both the optimal strategy of the Gambler and the value of the game. In Section 6 we compare the optimal regret bound to previous results, and in Section 7 we draw a connection between our game and a well studied version of the coupon collector problem. We also briefly summarize what is known about the asymptotics of this problem. We conclude with a discussion of our results and list open problems (Section 8). 2 The Value of the Game Assume that in each event the Gambler has already suffered some losses specified by state vector s. Define V (s) to be the total money lost by an optimal Gambler playing against an optimal Casino starting from the state s. That is, V (s) is the amount of money that the optimal Gambler will lose (against an optimal Casino) from now until the end of the game. Roughly speaking, the value of the game is computed as: ? V (s) = min max n w ˇ + V (s + ) dist. w {0,1} The Gambler chooses w to minimize the loss while the Casino chooses to maximize the loss, where the loss is computed as the loss w ˇ on this round plus the worst case loss V (s+ ) on future rounds. However, we have to be careful, as this recursive definition doesn't address the following issues: ˇ When is the game over? What is the base case of V (ˇ)? ˇ Is this recursion bounded? ˇ Do we need to record the losses si that go above k ? We address these issues beginning with some simplifications and notational conventions. First, we assume that the state vector s lies within the set S = {0, 1, . . . , k + 1}n . Note that it is not necessary to record the losses of events that have already crossed the k threshold. We call such events dead. Since the losses of dead events are not restricted, having loss k + 3 is the same as loss k + 100. We therefore "round" all states s into the state space {0, 1, . . . , k + 1}n using the no tation + which we define below. We use the notation (s) to record the set of live events; the statement i (s) is exactly / the statement si = k + 1. Second, as the game is defined recursively, we must guarantee that this recursion terminates. If the Casino repeatedly chose = 0, . . . , 0 , for example, the game would make no progress. The same problem occurs if the Casino causes losses on only dead events. We must therefore place additional restrictions so that the dead state is reached eventually. The simplest way to ensure this is to forbid the Casino from inflicting loss on only dead events. Yet this is not sufficient: with this restriction alone the Gambler would have a guaranteed non-losing strategy by betting solely on dead events. We thus assume that neither can the Casino can inflict losses on dead events nor can the Gambler bet money on them (keeping in mind that all such bets are in any case nonoptimal). We must enforce this explicitly in order to have a well-defined game. We use two notational conventions to describe the above restrictions. First, we write w (s) to describe the set {w n | wi = 0 i (s)} where n is the n-simplex. / We also abuse notation slightly and write (s) to mean that {0, 1}n and i = 0 for all i (s). / We now define the value of the game precisely. Definition 1 Define the value V (s) of the game as follows. ˇ At the dead state, V (d) := 0. ˇ For any other s S , we define V (s) recursively as V (s) := min w(s) 0= (s) One of the central results of this paper is that the above game, while seemingly more restricted, is ultimately just as difficult for the Gambler as the original game. It is easy to show that V (s) V (s), since the Casino has strictly more choices in the original game. We go further and prove as our main result in Theorem 12 that V (s) = V (s). Thus both games have the same worst-case outcome. Both the analysis of the modified game, as well as the proof of the above result, requires a different formulation of the Casino's actions. 3 A Randomized Casino In Section 2 we presented a game-theoretic analysis of a well-known sequential prediction problem characterized as a game between a Gambler and a Casino. In the present section, we consider a different framework, in which the Casino uses random events. We will show that introducing a randomized strategy of the Casino enables us to specify the optimal strategy of the Gambler. 3.1 A Random Walk on the State Graph Let us now imagine that our Casino does not fix outcomes deterministically, but instead chooses the outcome of each event using the following random process. Assume we are at state s and that, on each day, an event i is chosen uniformly at random from {1, . . . , n} and a loss is assigned to event i. In other words, the loss vector is a uniformly sampled unit vector ei , and after the loss the new state is s+ei . This process continues until we reach the dead state d. We can model this behavior as a Markov process on the state space as follows. Considert any sequence of indices I1 , I2 , . . . [n], and let St := m=1 eIm , where S0 := 0. Assuming that we start at state s, this induces a sequence of states s = s+S0 s+S1 s+S2 . . . s+St . Notice that this process has "self-loops"; i.e. it is quite pos sible that s+St = s+St+1 . This occurs when (s+St )It+1 is already at k + 1. If we imagine the state space S as an n-dimensional lattice, which we will call the state lattice, then the Markov process above can be interpreted as a random walk on this lattice. The walker starts at the initial state 0, and on every time interval a positively directed single step is taken along an axis drawn uniformly at random. If the walker has already reached the k + 1 boundary in this dimension, he remains in place. The walk stops once the dead state d is reached. We will show that the value V is 1/n times the expected total number of random draws that achieves this position. Thus V is the expected walk/path length from s to d. 3.2 Survival Probabilities We now define a survival probability at a state s. We will show in the next section that such probabilities are the basis for the Gambler's optimal strategy. Definition 3 Assume we are at state s, and let the random state s+St be the result of the above random walk after t max w ˇ + V (s + ). (2) In our notation, we commonly make use of several special states. The state where the game begins is the "initial" state, s = 0. Once all events have lost more than k times the game is over and we refer to this as the dead state d. It will also be useful to consider one-live states oi , where all events except i are dead, and the remaining event has exactly k losses. By the game definition, it is easy to check that V (oi ) = 1, since the Gambler must bet all of his money on this event, and the Casino must inflict a corresponding loss, charging the Gambler $1 and ending the game. Below, we include a list of notations for reference: Notation: S :={0, . . . , k + 1}n 0 := 0, 0, . . . , 0 = 0n d := (k + 1)n oi := d - ei (s) := {i [n] : si k } s+ := min(si + i, k + 1) s |s| := i n := {w Rn : |w| = 1} + (the state space) (the initial state) (the dead state) (ith one-live state) (set of live events) ("rounded" addition) (elementwise sum) (the n-simplex) 2.1 The Modified Game We also consider a modified game that we make easier for the Gambler. In this new game, we restrict the Casino to inflict loss on exactly one event in each round, i.e. must be a basis vector e1 , . . . , en . So for = ei we have w ˇ = wi . We can then precisely define the value V (ˇ) of the modified game: Definition 2 Define V (d) := V (d) = 0. Otherwise V (s) := min w(s) i(s) max wi + V (s + ei ). (3) steps. Define the ith survival probability pi (s) to be the probability that t : s+St = oi . Equivalently, pi (s) = P r((s+St ) = {i} for some t). We call these survival probabilities since pi (s) is the probability that, if the losses were assigned randomly to the events in sequence, the ith event would be the last non-dead event. Lemma 4 For any s = d, the vector p(s) := pi (s) n 1 i= defines a distribution on {1, . . . , n}. i Proof: The quantity pi (s) is the probability that eventually there is exactly one live event. This probability is exactly 1, given that the current state is not the dead state d. We list some examples of survival probabilities: ˇ When s = 0 (or any other symmetric state), we have 1 pi (s) = , i n bexause there is a uniform chance of survival. ˇ When i is a dead event, i.e. si = k + 1, then pi (s) = 0 because no dead event can be the last remaining live event. ˇ If there is only one remaining live event, i.e. (s) = {i}, then pi (s) = 1. Computing pi (s) for more general s requires a recursion, and we leave this discussion for Section 5. 3.3 Expected Path Lengths Another important quantity we consider is the length of a random path, i.e. the number of steps in the random walk on the state lattice required until the dead state d is reached. Definition 5 For a sequence S0 , S1 , . . ., let T (s) := min{t 0 : s+St = d}. That is, T (s) is the length of the random path starting at s and just entering d. Furthermore, let (s) := E T (s) be the expected path length. We note that paths may be infinitely long due to self-loops, yet such paths occur with probability 0. A key fact is that the expected path length (s) can be rewritten using indicator variables: t T (s) = 1[s+St = d], (4) =0 Lemma 6 For any state s and event i, pi (s) = 1 ( (s) - (s+ei )). n Proof: When i (s), then s = s+ei and it is trivially true / that 1 pi (s) = 0 = ( (s) - (s+ei )). n The interesting case is when i (s). Indeed, Using (4), we have (s) - (s + ei ) = E T (s) - E T (s + ei ) . t St = d] - 1[(s + ei )+St = d] =E 1[s+ =0 Since the dead state d is an absorbing state we have that for any path S , if s+S = d, then s + ei +S = d as well. Equivalently, if (s + ei )+S = d, then s+S = d. Thus in the difference between the expectations, we only need be concerned with sequences St that are accounted for in the first expectation but not in the second. Therefore the above difference becomes . t =E 1[(s+St = d) ((s + ei )+St ) = d] =0 We claim that any sequence St that satisfies the conjunction must have the property that (St )i = k - si . This is true because (s + ei )+St = d and therefore (St )i k + 1 - si . Also (St )j k + 1 - sj , for j = i. This implies that s+St = oi and the above difference becomes . E t=0 1[s+St = oi ] The last term is exactly pi (s), the probability that s+St eventually arrives at oi , times the expected number of iterations spent in state oi before arriving at d. To leave oi , the random walk must make a step in the ith direction, and thus the expected "waiting time" at oi is can be computed as q 1 q (1 - )q-1 n =1 p 1 n p rob. of leaving = n. rob. of q - 1 loops The last lemma implies an important fact about the state lattice. Interpret the state lattice as a directed graph with directed edges at all pairs (s, s + ei ) for each i (s). Also associate the edge (s, s + ei ) with the survival probability pi (s). Consider starting at state s and walking through this directed graph: s s + ei 1 s + ei 1 + ei 2 . . . Corollary 7 Consider any two states s, s . For any path from s to s through the directed state graph, the sum of all edge weights pi (ˇ) along this path is independent of the choice of path. i.e. T (s) is the number of initial segments (including the empty segment) of a random path starting at s that has not reached the dead state d. We now prove a relationship between expected path length (s) and survival probabilities pi (s): Proof: Assume the path s = s1 , s2 , . . . , sT , sT +1 = s defined by a sequence of moves is i1 , i2 , . . . , iT , where st+1 = st + eit . By Lemma 6 the total weight sum is tT =1 Now assume that m(s) > 0. Then V (s) (induc.) = = w(s) i(s) min max wi + V (s + ei ) max wi + 1 (s + ei ) n pit (st ) = tT 1 1 ( (st ) - (st+1 )) = ( (s) - (s )), n n =1 w(s) i(s) min which is independent of the choice of path. (Lem. 6) i(s) max pi (s) + i 1 (s + ei ) n = = max Note that in the definition of the directed state graph above and in the corollary we ignore loops, which occur when s = s+ei (or equivalently i (s)). Such loops out / of state s are immaterial because they correspond to dead events, and i (s) iff pi (s) = 0. / 1 1 ( (s) - (s + ei )) + (s + ei ) n n 1 (s). n 4 The Optimal Strategy We now have the all the tools to express V (s) in terms of the expected path length (s), prove that V (s) = V (s), and show that the optimal betting strategy for the gambler is p(s). We prove two major theorems in this section. We provide the mathematically precise argument for each but, as formality often obscures the true intuition, we also provide an "English Version" so that the reader sees a rough sketch. Our mathematical proofs require induction on the state space S , so we need a "measure of progress" for state vectors s. For any s S , define m(s) := n(k + 1) - |s|, the number of steps required before reaching the dead state. Clearly m(s) = 0 if and only if s = d. Theorem 8 For all states s, V (s) = 1 (s). n Proof: (English Version) Assume that the Gambler always plays according to the distribution vector p(s). Then we may think of the Casino's choices as a walk around the state graph and, as we discussed at the end of Section 3, a collection of the "weights" pi (ˇ) along the way, ending at d. But as we proved in Corollary 7 for the weights p(.), it doesn't matter 1 what path is taken: the Casino will always receive n ( (s) - 1 (d)) = n (s) on any path from s that just ended in d. If the Gambler ever chooses a distribution w different from p(s) at some state s, then the Casino can simply let = ej for any j for which wj > pj (s), and on this round the casino will force loss greater than pj (s). This means that on some path starting from s, the Casino will accrue total 1 weight/loss larger than n (s), and therefore that the distribution w at s was non-optimal for the Gambler. We conclude that for the Gambler p(.) is the only optimal assignment of distributions to states. Proof: (Formal Version) We induct on m(s). First we check the base case s = d. In this case, the expected path length is exactly 0 since we have already reached the dead state. Thus ( s) V (d) as desired. n =0= 1 We prove V (s) n (s) by a similar induction. Assume that the Gambler chooses the optimal distribution w which may indeed be different from p(s). For any i (s), pi (s) / is defined as zero. For the optimal strategy wi = 0 as well because otherwise the Casino can incurr unbounded loss by playing ei repeatedly. Since w and p(s) are different distributions on the live events (s), there must exist some j (s) for which wj > pj (s). We now have V (s) (induc.) = = i(s) max wi + V (s + ei ) i(s) max wi + 1 (s + ei ) n (Lem.6) 1 (s + ei ) n 1 > pj (s) + (s + ei ) n 1 1 = ( (s) - (s + ei )) + (s + ei ) n n 1 = (s). n wj + Corollary 9 For any s = d, p(s) is the unique optimal probability vector for the learner for the game related to V . Proof: See end of last proof. Corollary 10 For all s and all i [n], pi (s) = V (s) - V (s+ei ) Proof: This follows from the previous theorem and Lemma 6. We need one more lemma before we can prove our main result. Lemma 11 For any state s and distinct events i, j (s), we have pi (s) < pi (s + ej ). This fact is intuitive: if losses are randomly assigned then the probability that the ith event will survive last strictly increases when another event suffers a loss. We prove this precisely below. Proof: To show that pi (s) pi (s + ej ) is straightforward. Any sequence S0 , S1 , S2 , . . . that brings s to the one-live state oi also brings s + ej to oi . Indeed, if s+St = oi for some t then certainly (s + ej )+St = oi as well. To show that this inequality is strict, we need only find one random sequence for which s+ej is brought to oi but not s. Take any sequence S0 , S1 , . . . such that s+St = d - ei - ej (where the only events remaining are i and j ) and where St+1 = St + ei . Then (s + ej )+St = oi but s+St+1 = s+(St + ei ) = oj . Theorem 12 For all states s, V (s) = V (s) = 1 (s). n and in this case, V (s) = V (s) and we are done. We now prove by contradiction that can have no more than one non-zero coordinate. Assume indeed that | | > 1, i.e. it admits a decomposition = ei + ¯ for some i and bit vector ¯ = 0 with ¯i = 0. Applying Lemma 11 repeatedly, we have that pi (s) < pi (s + ¯) and therefore p(s) ˇ + V (s + ) = pi (s) + p(s) ˇ ¯ + V (s + ) (Lem. 11) < pi (s + ¯) + p(s) ˇ ¯ + V (s + ) (Cor. 10) = V (s + ¯) - V (s + ) + p(s) ˇ ¯ + V (s + ) = p(s) ˇ ¯ + V (s + ¯). But the statement p(s) ˇ + V (s + ) < p(s) ˇ ¯ + V (s + ¯) implies is a non-optimal choice for the Casino and this contradicts our assumption that was optimum. Corollary 13 For any s = d, if the learner plays with the optimum probability vector p(s), then the only optimal responses of the adversary in the recurrence (2) for V is to choose a unit vector of a live event. Proof: Proved at the end of the last theorem. Proof: (English Version) Imagine a gambler who plays the distribution p(s) at every state s. We already know that the Casino can use its modified game strategy and simply play 1 unit vectors = ei on each round to force n (s) loss. Yet since is unrestricted, can it obtain more? The answer is l No: consider what happens if the Casino decides to choose arger than a unit vector, e.g. let = ei + ej for simplicity. Then on this round it obtains pi (s) + pj (s), but it can do better! We proved in Lemma 11 that survival probabilities strictly increase and therefore pi (s) < pi (s + ej ). Thus, a more patient Casino could choose = ej on this round, obtain pj (s), and then choose = ei on the next round to obtain pi (s + ej ). As pj (s) + pi (s + ej ) > pj (s) + pi (s), the Casino only does worse by playing non-unit vectors. Indeed, this suggests that the Gambler has a strategy by which the Casino can inflict only as much loss as in the modified game, and thus the value V (s) is no different from V (s). Proof: (Formal Version) Certainly V (s) V (s), since the Casino is given strictly fewer choices in the modified game. Thus we are left to show that V (s) V (s). We proceed via induction on m(s). By definition, V (s) = V (s) for the w case s = d. Now assume that, for all successive states s here m(s ) < m(s), V (s ) = V (s ). We proceed by directly analyzing the recursive definition (2). Assume that the Gambler has chosen the (possibly non-optimal) distribution w = p(s) to distribute his wealth on the live events (s), and let {0, 1}n be an optimal choice of the Casino (which can depend on the Gambler's choice). By definition (1) of V (s), the chosen loss vector can't be 0 and all events with loss one must be in (s). More precisely, V (s) (ind.) 5 Recurrences, Combinatorics and Randomized Algorithms The quantities V (s), (s) and pi (s) have a number of interesting properties that we lay out in this section. 5.1 Some Recurrences The expected path length, (s) satisfies a very natural recursion. When s = d, then the path length is deterministically 0 and therefore (d) = 0. Otherwise, we see that the expected path length is n (s+ei ) (s) = 1 + i=1 . (5) n That is, the expected path length is 1, for the current step in the path, plus the expected path length of the next random state. Since the next state is chosen randomly from the set {s+ei : i = 1, . . . , n}, the probability of any given state is 1 , hence the normalization factor. n Of course, our original quantity of interest is V (s), and as 1 we showed in Theorem 12 V (s) = n (s). This immediately gives us a recursion for V : = 1 n 1 1i (s+ei ) V (s) = + n n =1 n 1 + i=1 V (s+ei ) . n This recurrence, while true for the function V (ˇ), is ambiguous because V (s) can occur on both sides of the equation. Indeed, whenever i (s), V (s+ei ) = V (s). However, we / can rearrange all V (s) terms to obtain the following welldefined recursion: i 1+ (s) V (s + ei ) V (s) = . (6) |(s)| = = w(s) 0= (s) w(s) 0= (s) 0= (s) min max w ˇ + V (s + ) max w ˇ + V (s + ) min max p(s) ˇ + V (s + ) = p(s) ˇ + V (s + ) If is any unit vector ei , s.t. i (s), then V (s) p(s) ˇ ei + V (s + ei ) = pi (s) + V (s + ei ) = V (s) We can find a similar recurrence for pi (ˇ). For the onelive states oi we have pj (oi ) = 1 if i = j and 0 otherwise. If |(s)| > 1, then n j =1 pi (s+ej ) . pi (s) = n As pi (s) is the probability of ending at state oi after executing the Markov chain, this formula is obtained by conditioning on one step of the Markov process. That is, the probability of ending at state oi is j P r(j chosen)P r(random process takes s+ej to oi ). This recurrence suffers from the same problem as did our initial recurrence for V (ˇ): p(s) can occur on both sides of the equality. We again solve this problem by rearranging terms and obtain j (s) pi (s + ej ) pi (s) = . |(s)| 5.2 Combinatorial Sums A further analysis gives us exact expressions for both pi (s) and V (s) in terms of infinite sums of multinomials. Proposition 5.1 For any state s S , | r r| pi (s) = r1 , r 2 , . . . , r n :s+r=oi mean. Such random approximations require that the distribution on T (s) has low-variance, yet this certainly holds in the case at hand. While the random walk requires at least n(k + 1) iterations to finish, a simple argument shows that with probability 1 - the random walk completes in less than nk log(nk / ) rounds. Algorithm 1 Random Approximation to V (s) Input: state s t0 for i = 1, . . . , NUMITER do zs repeat Sample i {1, . . . , n} u.a.r. z z+ei tt+1 until z = d end for t Return nˇNUMITER . If R(s) is the random variable returned by the above algorithm, then clealy ER(s) = V (s). By increasing NUMITER, the variance of this estimate can be reduced quickly. A randomized approximation for p(s) can be obtained similarly. Again the above algorithm approximately comAlgorithm 2 Random Approximation to p(s) Input: state s = d p0 for i = 1, . . . , NUMITER do zs repeat Sample i {1, . . . , n} u.a.r. z z+ei until z = oj for some j p p + ej end for Return NUMpTER . I putes p(s) in the following sense: If R(s) is the random variable returned by the above algorithm, then clearly ER(s) = p(s). Again increasing NUMITER, reduces the variance of the estimate. 5.4 A Simple Strategy in a Randomized Setting 1 |r|+1 n . Proof: By definition, pi (s) is the probability that s reaches the one-live state oi eventually. To compute this probability, we consider at what point the Markov process exits the state oi and into d. Recall the random variable St defined in Section 3. Take any r for which s+r = oi and condition on St = r. Then r pi (s) = P r(St = r)P r(St+1 = r + ei |St = r) :s+r=oi The first probability is exactly ond probability is exactly 1/n. r| r1 ,r2 ,...,rn | n-|r| and the sec- Since V (s) can be written as an expected path length, we can obtain a similar expression as a sum of multinomials for V (s): Proposition 5.2 V (s) = in r =1 :r+s=oi | r| (|r| + 1) r1 , r 2 , . . . , r n 1 |r|+1 n . 5.3 Randomized Approximations Computing the exact value V (s) for large but non-asymptotic values of the state vector is difficult because we have no polynomial time algorithm. On the other hand, finding a randomized approximation to V (s) can be done very efficiently. Indeed, as we now have a representation of V (s) in terms of the length of a random walk, we can simply run the random walk S1 , S2 , . . . several times, note the length T (s), and return the In the particular case of betting against the Casino, it may be necessary for the Gambler to compute pi (s) in order to place his bets optimally. In an alternative setting, however, a randomized algorithm may be sufficient. Let us consider the case in which the Gambler chooses to bet according to the outcome of several coin tosses. Further assume that the Casino can observe his strategy but cannot see the outcome of the coin tosses or his final bets. In this scenario, the Gambler can even bet all of his money on a random event I {1, . . . , n} drawn according to some distribution as long 22 R a ndo mly C o mputing V comparison of the regret for n = 2, 10, 100 and k = 1, . . . , 20. Th e Op t i ma l B o u n d v s . t h e H e d g e B o u n d Optimal, n=2 Hedge, n=2 Optimal, n=10 Hedge, n=10 Optimal, n=100 Hedge, n=100 20 18 18 16 T he v al ue V ( 0 ) - k 16 14 14 12 12 R e gr e t 40 10 10 8 10 Samples 1 Sample Exact Value of V 8 6 6 4 4 0 5 10 15 20 25 30 35 2 T he par am e t e r k 0 0 2 4 6 8 10 12 14 16 18 20 T he par ame t e r k Figure 1: We illustrate the accuracy of the randomized approximation to V (0) stated in Algorithm 1. The plot compares the exact value of V (0) to that obtained by using either 1 or 10 samples of the random walk. Here n = 100. Figure 2: We compare the optimal regret bound we obtain from V (0) to that found in [FS97], which we refer to as the hedge bound. Whilasymptotically optimal, we observed that the hedge bound of e k + 2k log n + log n is not tight for small values of n and k. as E1[I = i] = pi (s) for all i, and indeed his expected loss would be p(s) ˇ . For this scenario, randomly approximating p is not necessary: only one sample is needed! To be precise, the Gambler can take the state s, run the random walk until the state reaches oi for some i, and then bet his full dollar on event i. This bet will be correct in expectation, i.e. he will pick event i with probability pi (s), and thus his expected loss will be exactly p ˇ . The key here is that sampling from the distribution p(s) may be quite easy even when computing it exactly may take more time. Note that the above method based on one sample is similar to the way the Randomized Weighted Majority algorithm approximates the Weighted Majority algorithm (more precisely the WMC algorithm of [LW94]). More precisely NUMITER=1 of Algorithm 5.3 corresponds to WMR, and NUMITER corresponds to WMC. 7 Connections to classic problems of probabilistic enumerative combinatorics. Theorem 12 shows that an optimal strategy for the Casino requires unit vector plays. This leads to alternative interpretations of the game in terms of well studied random processes. For example, one can easily confirm that our game also describes the random process underlying a generalized form of the Coupon Collector's Problem [? ] in which the collector buys cereal boxes one by one in order to obtain K = k + 1 complete sets of n baseball cards, assuming one card is randomly placed within each cereal box. The value of our game, V (0, 0), is in fact the expected number of cereal boxes, per baseball card, needed to obtain the desired K complete sets. Specifically, the probability generating function for the generalized Coupon Collector's Problem is [MW03] n-1 j tj n Gn,K (z ) = e-nt/z tK -1 dt. (K - 1)! 0 j! k 6 Comparison to Previous Bounds As mentioned in the introduction, the bound obtainable base on exponential weights [FS97] is 2 k+ k log n + log n (7) and can be shown to be asymptotically optimal [Vov98].2 Having computed the minimax solution to the same game, we can compute the game-theoretically optimal bound of V (0) using Algorithm 1. For small values of n and k , these bounds do differ quite substantially. We present in Figure 2 a 2 A slightly better but more complicated bound than (7) was given in [Vov98]. In the full paper we compare the optimal bound to this one as well. Taking the derivative at z = 1 and dividing by n, we derive the expected number of steps to obtain K sets, which is also the value of our game, viz. n-1 j tj n V (0n ) = tK e-nt dt. (8) (K - 1)! 0 j! k Equation (8) gives us an elegant closed form for the two-card case (n = 2): 2 K K V ( 0, 0 ) = K + 2K 2 K From (8) we also obtain the well known asymptotic expression for the value, for large n and fixed K , V (0n ) n log n + (K - 1) log log n[1 + o(1)]. The same asymptotic form appears in the analysis of an evolving random graph. [ER60] The random walk on the state lattice provides yet another interpretation of the same dynamics. For K >> n >> 1, the law of large numbers gives [NS60] V ( 0n ) = K + O(K 1/2 ). [FS97] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to Boosting. J. Comput. Syst. Sci., 55(1):119­139, 1997. Special Issue for EuroCOLT '95. [HW07] D. Helmbold and M. K. Warmuth. Learning permutations with exponential weights. In Proceedings of the 20th Annual Conference on Learning Theory (COLT07). Springer, 2007. [KW99] Jyrki Kivinen and Manfred K. Warmuth. Averaging expert predictions. In Computational Learning Theory, 4th European Conference, EuroCOLT '99, Nordkirchen, Germany, March 29-31, 1999, Proceedings, volume 1572 of Lecture Notes in Artificial Intelligence, pages 153­167. Springer, 1999. [LW94] N. Littlestone and M. K. Warmuth. The Weighted Majority algorithm. Inform. Comput., 108(2):212­261, 1994. Preliminary version in in FOCS 89. [MW03] A. Myers and H. S. Wilf. Some new aspects of the Coupon-Collector's problem. SIAM J. Disc. Math., 17:1­17, 2003. [NS60] D. Newman and L. Shepp. The Double Dixie Cup problem. Amer. Math Monthly., 67:541­ 574, 1960. [Vov98] V. Vovk. A game of prediction with expert advice. J. of Comput. Syst. Sci., 56(2):153­173, 1998. Special Issue: Eighth Annual Conference on Computational Learning Theory. [WK06] M. K. Warmuth and D. Kuzmin. Randomized PCA algorithms with regret bounds that are logarithmic in the dimension. In Advances in Neural Information Processing Systems 19 (NIPS 06). MIT Press, December 2006. 8 Conclusion We showed in Corollary 13 that against the optimal learning algorithm the optimal strategy of the adversary is to choose one of the unit loss vectors as his response. Curiously enough it can be show that this is also true of the Weighted Majority algorithm (1). That is, any trial in which q > 1 experts incurred a unit of loss can be split into q trials in which a single expert has a unit of loss, and doing this always increases the loss of the algorithm for all update factor [0, 1). This observation about the Weighted Majority algorithm might actually lead to improved loss bounds for this algorithm, perhaps in the way the parameter is tuned. There remains also a deep question regarding the techniques introduced in this paper: how general is this method of computing the value of a game based on a random path? Can it handle slightly more involved problems? Examples we have considered include competing against m-sized sets of experts, discussed in [WK06], in which the loss of the algorithm is compared to the loss of the best m-subset. Another example is the problem of competing against permutations of n objects [HW07], where the loss of a permutation is linearly assigned. Our preliminary investigation suggests that similar techniques can be adapted to also handle such more complex problems. In the full paper we hope to delineate the scope of our new method of optimal algorithm design. References [ALW07] J. Abernethy, J. Langford, and M. K. Warmuth. Continuous experts and the Binning algorithm. In Proceedings of the 19th Annual Conference on Learning Theory (COLT06), pages 544­558. Springer, June 2007. ` [CBFH+ 97] Nicolo Cesa-Bianchi, Yaov Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. J. ACM, 44(3):427­485, 1997. [CBFHW96] N. Cesa-Bianchi, Y. Freund, D. P. Helmbold, and M. K. Warmuth. On-line prediction and conversion strategies. Machine Learning, 25:71­110, 1996. [ER60] P. Erdos and A. Renyi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci., 5A:17­61, 1960.