Adapting to a Changing Environment: the Brownian Restless Bandits

Aleksandrs Slivkins and Eli Upfal

Abstract
In the multi-armed bandit (MAB) problem there are k distributions associated with the rewards of playing each of k strategies (slot machine arms). The reward distributions are initially unknown to the player. The player iteratively plays one strategy per round, observes the associated reward, and decides on the strategy for the next iteration. The goal is to maximize the reward by balancing exploitation: the use of acquired information, with exploration: learning new information. We introduce and study a dynamic MAB problem in which the reward functions stochastically and gradually change in time. Specifically, the expected reward of each arm follows a Brownian motion, a discrete random walk, or similar processes. In this setting a player has to continuously keep exploring in order to adapt to the changing environment. Our formulation is (roughly) a special case of the notoriously intractable restless MAB problem. Our goal here is to characterize the cost of learning and adapting to the changing environment, in terms of the stochastic rate of the change. We consider an infinite time horizon, and strive to minimize the average cost per step which we define with respect to a hypothetical algorithm that at every step plays the arm with the maximum expected reward at this step. A related line of work on the adversarial MAB problem used a significantly weaker benchmark, the best time-invariant policy. The dynamic MAB problem models a variety of practical online, game-against- nature type optimization settings. While building on prior work, algorithms and steady-state analysis for the dynamic setting require a novel approach based on different stochastic tools.
Microsoft Research, Mountain View CA. E-mail: slivkins at microsoft.com. Parts of this work has been completed while A. Slivkins was a postdoc at Brown University.  Computer Science Department, Brown University, Providence RI. E-mail: eli at cs.brown.edu. Supported in part by NSF awards CCR-0121154 and DMI-0600384, and ONR Award N000140610607.


1 Introduction
The multi-armed bandit (MAB) problem [27, 5, 12] has been studied extensively for over 50 years in Operations Research, Economics and Computer Science literature, modeling online decisions under uncertainty in a setting in which an agent simultaneously attempts to acquire new knowledge and to optimize its decisions based on the existing knowledge. In the basic MAB setting, which we term the static MAB problem, there are k time-invariant probability distributions associated with the rewards of playing each of the k strategies (slot machine arms). The distributions are initially unknown to the player. The player iteratively plays one strategy per round, observes the associated reward, and decides on the strategy for the next iteration. The goal of a MAB algorithm is to optimize the total reward1 by balancing exploitation: the use of acquired information, with exploration: learning new information. For several algorithms in the literature (e.g. see [5, 2]) as the number of rounds goes to infinity the expected total reward asymptotically approaches that of playing a strategy with the highest expected reward. The quality of an algorithm for the static MAB problem is therefore measured by the expected cost, or regret, incurred during an initial finite time interval. The regret in the first t steps is defined as the expected gap between the total reward collected by the algorithm and that collected by playing an optimal strategy in these t steps. The MAB problem models a variety of practical online optimization problems. As an example consider a packet routing network where a router learns about delays on routes by measuring the time to receive an acknowledgment for a packet sent on that route [4, 16]. The delay for one packet on a given route is a random value drawn from some distribution. The router must try various routes in order to learn about the delays. Trying a loaded route adds unnecessary delay to the routing of one packet, while discovering a route with low delay can improve the routing of the future packets. Another application is in marketing and advertising. A store would like to display and advertise the products that sell best, but it needs to display and advertise various products to learn how good they sell. Similarly, a web search engine tries to optimize its revenue by displaying advertiseIn this paper the total reward is simply the sum of the rewards, following the line of work in [21, 2, 3] and many other papers. Alternatively, many papers consider the time-discounted sum of rewards, e.g. see [5, 12, 29] and references therein.
1


ments that would bring the largest number of clicks for a given web content. The company needs to experiment with various combinations of advertisements and page contents in order to find the best matches. The cost of these experiments is the loss of advertisement clicks when trying unsuccessful matches [25]. The above examples demonstrate the practical applications of the "explore and exploit" paradigm captured in the MAB model. These examples also point out the limitation of the static approach to the problem. The delay on a route is gradually changing over time, and the router needs to continuously adapt its routing strategy to the changes in route delays. Taste and fashion change over time. A store cannot completely rely on information collected in the previous season to optimize for the next one. Similarly, a web search engine continually updates their content matching strategies to account for the changing customers' response. A number of models have been proposed for capturing the dynamic aspect of the MAB problem. Motivated by task scheduling, Gittins [13] considered the case where only the state of the active arm (the arm currently being played) can change in a given step, giving an optimal policy for the Baysean formulation with time discounting. This seminal result gave rise to a rich line of work (e.g. [11, 12, 32, 31, 30, 6, 29]), a proper review of which is beyond the scope of this paper. In particular, Whittle [33] introduced an extension termed restless bandits [33, 7, 24], where the states of all arms can change in each step according to a known (but arbitrary) stochastic transition function. Restless bandits are notoriously intractable: e.g. even with deterministic transitions the problem of computing an (approximately) optimal strategy is P S PAC E-hard [26]. Guha et al. [14, 15] have recently made a progress on some tractable special cases of the restless MAB problem.2 Their motivations, the actual problems they considered, and the techniques they used, are very different from ours. In [14] they gave a constant-factor approximation for the special case of the problem in which arms move stochastically between two possible states. This result was improved to a 2-approximation in [15], and extended to arms assuming a number of possible states, but with a very strict set of transition probabilities that are not compatible with the stochastic processes discussed here. Auer et al. [3] adopted an adversarial approach: they defined the adversarial MAB problem where the reward distributions are allowed to change arbitrarily in time, and the goal is to approach the performance of the best time-invariant policy. This formulation has been further studied in [1, 20, 17, 22, 10, 9, 19, 8]. Auer et al. [3, 1] also considered a more general definition of regret, where the comparison is to the best policy that can change arms a limited number of times. Due to the overwhelming strength of the adversary, the guarantees obtained in this line of work are relatively weak when applied to the setting that we consider in this paper. We propose and study here a somewhat different approach to addressing the dynamic nature of the MAB problem. We note that in a variety of practical applications the time evolution of the system, in particular of the reward functions, is gradual. Obvious examples are price, supply and demand
These papers were published after the initial technical report version of this paper appeared.
2

in economics, load and delay in networks, etc. A gradual stochastic evolution is traditionally modeled via a random walk or a Brownian motion; for instance, in Mathematical Finance the (geometric) Brownian motion (Wiener process) is the standard model for continuous-time evolution of a stock price. In line with this approach, we describe the state of each arm ­ its expected reward at time t ­ via a Brownian motion.3 The actual reward at a given time is an independent random sample from the reward distribution parameterized by the current state of this arm, e.g. a 0-1 random variable with an expectation given by the state of the arm (in the web advertising setting this corresponds to a user clicking or not clicking on an ad). We are interested in systems that exhibit a stationary, steady-state behavior. For this reason instead of the usual Brownian motion on a real line (which diverges to infinity) we consider a Brownian motion on an interval with reflecting bounds. Following the bulk of the stochastic MAB literature, we assume that the evolution of each arm is independent (in fact, we conjecture that regret is maximized in the case of independently evolving arms). Our goal here is to characterize the long-term average cost of adapting to such changing environment in terms of the stochastic rate of change ­ the volatility of Brownian motion. The paradigmatic setting for us is one in which each arm's state has the same stationary distribution and, therefore, all arms are essentially equivalent in the long term. In such setting the standard benchmark ­ the best time-invariant policy ­ is uninformative. Instead, we optimize with respect to a more demanding (and also more natural) benchmark ­ a policy that at each step plays an arm with the currently maximal expected reward. We consider two versions of the dynamic MAB problem described above. In the state-informed version an algorithm not only receives a reward of the chosen arm but also finds out the current state of this arm. This is the setting in the restless MAB problem as defined in Whittle [33] and the followup literature. In the second, state-oblivious, version an algorithm receives its reward and no other information. This formulation generalizes the static MAB problem to stochastically changing expected rewards. 1.1 The Dynamic MAB problem

Let {D(ľ) : ľ  [0; 1]} be a fixed family of probability distributions on [0; 1] such that D(ľ) has expectation ľ. Time proceeds in rounds. Each arm i at each round t has a state ľi (t)  [0; 1] such that the reward from playing arm i in round t is an independent random sample from D(ľi (t)). At each round t an algorithm chooses one of the k alternative strategies ("arms") and receives a reward. In the stateoblivious version, the reward is the only information that the algorithm receives in a given round. In the state-informed version, the algorithm also finds out the current state of the arm that it has chosen. The distributions D(ˇ) are not revealed to the algorithm (and are not essential to the analysis).
As we only sample arms at integer time points, we can equivalently describe the state as a sum of t i.i.d. normal increments. In fact, we allow the increments to come from a somewhat more general class of distributions.
3


The state ľi (ˇ) varies in an interval with reflecting boundaries. To clarify the concept of reflecting boundaries, consider an object that starts moving on an interval I = [0; 1], reversing direction every time it hits a boundary. If the object starts at 0 and traverses distance x  0, its position is x, x 1 (1) fI (x) = - 1 - (x 1), x > 1, where x = x (mod 2) = x - 2 x/2 . Similarly, we define fI (x), x < 0 as the position of an object that starts moving from 1 and traverses distance |x|. For concreteness we focus here on the case when each arm's state follows a Brownian motion. Similar results hold for related stochastic processes such as discrete random walks (see the Extensions Section). The state of each arm i undergoes an independent Brownian motion on an interval with reflecting boundaries. Specifically, we define ľi (t) = fI (Bi (t)) where I = [0; 1] is the fundamental interval and Bi is an independent Brownian motion with volatility i . Since we only sample ľi (ˇ) at integer times, we can also define it as a Markov chain: ľi (t) = fI ( ľi (t - 1) + Xi (t) ) , (2)

Ż Thus, for any fixed R > RA the expected average dynamic regret of algorithm A over any sufficiently large interval is at most R, and it is the best possible upper bound Ż of this form. Our goal is to bound RA in terms of the arms' volatility. We use the following notation throughout the paper. The state of arm i at time t is ľi (t). The maximal state at time t is ľ (t) = maxi[k] ľi (t). An arm i is maximal in round t if ľi (t) = ľ (t). 1.2 Results: the state-informed case We present an algorithm whose steady-state regret is optimal up to a poly-log factor. Theorem 1.2. Consider the state-informed dynamic MAB problem with k arms, each with volatility at most  . Assume that k <  - for some  < 1 . Then there exists a MAB 2 ~ algorithm whose steady-state regret is at most O(k  2 ). The algorithm is very intuitive. An arm with the highest last-observed state is called a leader and is played often, e.g. at least every other round. Suppose the last time some other arm i was observed was t rounds ago. By Azuma i quality ne ~ the state of this arm changed by at most ľ = O( t) since then, with high probability. If ľi (t) + ľ is smaller than the state of the leader, then there is no point yet in trying arm i again. Else, we mark this arm suspicious and enqueue it to be played soon. The main technical contribution here is the analysis, which is quite delicate since we need to deal with the complicated dependencies in the algorithm's behavior induced by the stochastically changing environment. Essentially, we manage to reduce the stochastic aspect of the problem to simple events in the state space. We achieve it as follows. Every time each arm is played, we spread the corresponding dynamic regret evenly over the corresponding idle time. This way we express the cumulative dynamic regret as a sum over the contributions of each arm in each round. We prove a uniform bound on the expectation of each such contribution. To this end, we identify a useful high-probability behavior of the system, derive deterministic guarantees conditional on this behavior (which is the tricky part), and then argue in terms of the corresponding conditional expectations. Surprisingly, the steady-state regret of our algorithm essentially matches a lower bound based on a very simple idea: if in a given round the states of the best two arms are within  4 from one another, then in the next round with constant probability either one of them can be  above another, so 4 any algorithm incurs expected dynamic regret ( ).5 Theorem 1.3. Consider the state-informed dynamic MAB problem with k arms of volatility  . Then the steady-state regret of any MAB algorithm is at least (k  2 ). 1.3 Results: the state-oblivious case Our algorithm for the state-oblivious case builds on an algorithm from [2] for the static MAB problem. That algorithm implicitly uses a simple "padding" function that for a given
The former event happens with probability (k ), so the steady-state regret is (k 2 ). This is the entire proof!
5

where each Xi (t) is an i.i.d. sample from N (0, i ). The stochastic rate of change is thus given by i , which we term the volatility of arm i. We assume that for each arm i the initial state ľi (0) is an independent uniformly random sample from I . This is a reasonable assumption given our goal to study the stationary behavior of the system. Indeed, the uniform distribution on I is the stationary distribution of the Markov chain 2 to which this Markov chain eventually converges.4 In the dynamic MAB problem, we measure the performance of a MAB algorithm with respect to a policy that at every step chooses a strategy with the highest expected reward. This policy changes in time, and thus it is a more demanding benchmark than the time-invariant regret that is often used in the MAB literature. Definition 1.1. Consider an instance of the dynamic MAB problem. For a given MAB algorithm A, let WA (t) be the reward received by algorithm A in round t. Let Ř be an algorithm that in every round chooses a strategy with the highest expected reward. The dynamic regret in round t is RA (t) = WŘ (t) - WA (t). Define the steady-state regret as 1 Ż RA = lim sup sup E
t
4

s t0 +t
=t0 +1

. RA (s) (3)

t0

t

The convergence follows from the ergodic theorem. It should be noted that the rate of convergence for Markov chains with infinite state spaces is a rather delicate matter, e.g. see Rosenthal [28]. In this paper the rate of convergence is non-essential. Moreover, the convergence itself does not appear in the proofs: it is used only as intuition and an (additional) justification for assuming the uniform distribution of the initial state.


arm bounds the drift of an average reward from its (static) expected value. We design a new algorithm U C Bf which relies on a novel "padding" function f that accounts for the changing expected rewards. The analysis is quite technical: the specific results from [2] do not directly apply to our setting; instead, we need to "open up the hood" and combine the technique from [2] with some new ideas. Theorem 1.4. Consider the state-oblivious dynamic MAB problem with k arms such that each arm i has volatility at most i . Then there exists a MAB algorithm whose steadyk 1 2 2 ~ state regret is O(k av ), where av = k i=1 i . Note that (unlike the guarantee in Theorem 1.2), the guarantee here is in terms of an average volatility rather than the maximal one. 1.4 Using off-the-shelf MAB algorithms?

be improved, possibly via a more refined mechanism for discounting information with time. Another open question is whether one can obtain the op~ timal O(k  2 ) steady-state regret for the state-informed version in the case when k   -1/2 . Note that the greedy algorithm mentioned in Section 1.4 achieves steady-state regret ~ O(k  ) which is non-trivial for any k   -1 . 1.6 Organization of the paper In Sections 2 and 3 we present our main results for the stateinformed and the state-oblivious versions, respectively. Section 4 discusses using off-the-shelf MAB algorithms. Section 5 covers the extensions.

2 The state-informed dynamic MAB problem
We consider the state-informed dynamic MAB problem where the volatility of each arm is at most  . Recall that the state of arm i at time t is denoted ľi (t). For arm i and time t, the last-seen time i (t) is the last time this arm has been played strictly before time t; the lastseen state i (t) = ľi (i (t)) is the corresponding state. Definition 2.1. The leader in round t is the arm with a larger last-seen state, among the arms played in rounds t - 1 and t - 2; break ties in favor of the arm played in round t - 1. In our algorithm, the leader is our running estimate for an arm with the maximal state. We alternate rounds in which we always exploit ­ play the leader, with rounds in which we may explore other options. Since we define the leader in terms of the last two rounds only, our knowledge of its state is essentially up-to-date. Let   (t) be the last-seen state of the leader in round t. 1 Let csusp = (log  )1/2 be the factor to be defined later. Definition 2.2. An arm i is called suspicious at time t if t - i (t). (4)   (t) - i (t)  csusp  If an arm i is not suspicious at time t, then with high probability its current reward is less than   (t). If no arm is suspicious then, intuitively, the best bet is to play the leader. Roughly, our algorithm behaves as follows: if the time is even it plays the current leader, and if the time is odd it plays a suspicious arm if one exists, and the leader otherwise. To complete the description of the algorithm, we need to specify what it does when there are multiple suspicious arms. In particular, we need to guarantee that after an arm becomes suspicious, it is played eventually (and preferably soon). Definition 2.3. An arm i is active at time t if it is not the leader and it has been suspicious at some time t > i (t). The activation time iact (t) is the earliest such time t . An arm becomes active when it becomes suspicious. It stays active until it is played. The idea is to play an active arm with the earliest activation time. Algorithm 2.4. For bootstrapping, each arm is played once. At any later time t do the following. If t is even, play the current leader. If t is odd play an active arm (with the earliest activation time) if one exists, else play the leader.

We ask whether similar results can be obtained using offthe-shelf MAB algorithms. Specifically, we investigate the following idea: take an off-the-shelf algorithm, run it and restart it every fixed number of rounds. For the state-informed version we consider the obvious "greedy" approach: probe each arm, choose the best one, play it for a fixed number m of rounds, restart. The greedy algorithm is parameterized by the phase length m which can be tuned depending on the number of arms and their volatility. We show that the greedy algorithm is indeed suboptimal as compared to Theorem 1.2: the dependence on volatility (which is smaller than one) is linear rather than quadratic; we provide both upper and lower bounds. For the state-oblivious version one can leverage on the existing work for the adversarial MAB problem [3]. This work assumes no restrictions on the state evolution, but provides guarantees only with respect to the best time-invariant policy, or a policy that switches arms a bounded number of times. We consider the following algorithm: run a fresh instance of algorithm E X P 3 from [3] for a fixed number m of rounds, then restart. Using the off-the-shelf performance guarantees for E X P 3 and fine-tuning m, one can (only) bound ~ the steady-state regret by O((k av )2/3 ), which is inferior to the result in Theorem 1.4. It is an open question whether one can obtain improved guarantees by tailoring the analysis in [3] to our setting. 1.5 Extensions and open questions We extend our results in several directions. First, we generalize the Markov-chain formulation (2) to allow the random increments Xi (t) to come from other distributions which has a certain "light-tailed" property, such as the discrete random walk. Second, we consider the setting in which each arm has a distinct fundamental interval. Third, we relax the assumption that the upper bound(s) on volatilities are known to the algorithm. The main question left open by this paper is to close the gap between the upper and lower bounds for the stateoblivious dynamic MAB problem. The only lower bound we have is Theorem 1.3. We conjecture that one may obtain a better bound based on the relative entropy-based technique from [3]. It is also possible that the algorithmic result can


We will use a slightly more refined algorithm which allows for a more efficient analysis. Essentially, we give priority to arms whose state is close to the leader's. Definition 2.5. Arm i is high-priority at time t if it is active at this time and moreover iact (t) - i (t)  4k . Algorithm 2.6. For bootstrapping, each arm is played once. At any later time t do the following. If t is even, play the current leader. If t is odd play an active arm if one exists, else play the leader. If there are multiple active arms: ˇ if t  1 (mod 4) then play an active arm with the earliest activation time; break ties arbitrarily ˇ if t  3 (mod 4) then play a high-priority arm with the earliest activation time if one exists; else, play any active arm; break ties arbitrarily. The analysis of these two algorithms are very similar, except that Algorithm 2.4 has inefficiencies which lead to an extra k 2 factor in its regret. We focus on Algorithm 2.6. Theorem 2.7. Consider the state-informed dynamic MAB problem with k arms, each with volatility at most  . Assume 1 that k <  - for some  < 2 . Then Algorithm 2.6 achieves 2 steady-state regret O(k  log2 1/ ). In the rest of this section we prove Theorem 2.7. Ż Let RA (t) be the average dynamic regret up to time t. Then, letting Ti (t) be the set of times arm i was played before and including time t, we have t Ż = 1i E [ľ (t ) - ľi (t )] . (5) E RA (t) t 
[k] Ti (t)

Definition 2.8. A real-valued function f is well-behaved on an interval [t1 ; t2 ] if for any t, t + t  [t1 ; t2 ] we have  |f (t + t) - f (t)| < cwell  t. (8)
1 where cwell = (log  )1/2 will be chosen later.

Definition 2.9. An instance of the dynamic MAB problem is well-behaved on a time interval I if (i) functions ľ1 (t), . . . , ľk (t) are well-behaved on I ; (ii) at each time t  I there are at ost cnear = O(1) arms i m such that ľi (t) < (8k + 15 k ) cwell  . 7 A problem instance is well-behaved near time t it is wellbehaved on the time interval [t - 3 -2 ; t +  -2 ]. Choosing the parameters. The factors cwell and cnear are chosen so that for any fixed t a problem instance is well-behaved near time t with probability at least 1 -  -3 . In Definition 2.1, define csusp = 5 cwell . Our conditionally deterministic guarantees (conditional on the problem instance being well-behaved) are expressed by the following lemma. Lemma 2.10 (The Deterministic Lemma). Suppose a problem instance is well-behaved near time t. Fix arm i and let  = ľi (t). Then: (a) If  = 0 and Ci (t) > 0 then
1 Ci (t)  O( log  )/

t

- i (t),

(9)

Let us spread contributions of individual arms evenly over the corresponding idle time. Specifically, let us define
ľi (t) i (t)

and moreover for some arm j = i we have t 1 ľj (t) < O ( log  ) - i (t). (b) If  > 0 then Ci (t)  O( 2 / )
1 log2  .

(10)

= ľ (t) - ľi (t), = i+ (t) - i (t),

Let us use Lemma 2.10 to derive the main result. Proof of Theorem 2.7: It suffices to prove (7). Let E (t) denote the event that the problem instance is well-behaved near t - i (t) and suptime t. By Lemma 2.10(a), letting x = 1 ~ pressing the log  factors under the O(ˇ) notation, E [Ci (t) | ľi (t) = 0, E (t)] ~  O( /x) Pr[j = i :
ľj (t)

where i+ (t) is the next time arm i is played after time i (t).6 Then we can re-write (5) as follows:  . t = 1i Ż ľi (i (t )) E RA (t) E (6) t i (t ) 
[k] [t]

We define the contribution of arm i in round t as ľi (i (t)) . Ci (t) = i (t) A crucial idea is that we upper-bound E [Ci (t)] for each round t separately. Namely, we will prove that
1 E [Ci (t)] < O( 2 log2  ).

~ < O( x)] ~  O( 2 ). (11)

By Lemma 2.10(b) for any  > 0 we have ~ E [Ci (t) | ľi (t)  , E (t)]  O( 2 / ) ~ Pr[ľi (t)   | ľi (t) > 0, E (t)]  O( ). Now (7) follows from (11-13) via a simple computation. In the rest of this section we prove Lemma 2.10.
7 This expression is tailored to (16) in the subsequent analysis. The event in question happens with probability at least 1 - O(cwell k2  )cnear . Recall that we assume k <  - for some  < 1 . Thus, a (large enough) constant cnear suffices to guarantee a 2 sufficiently low failure probability.

(12) (13)

(7)

We identify the high-probability behavior of the processes  ~ {ľi (t)}i[k] . Specifically, we consider the O( t) bound on deviations, and an O(1) bound on the number of nearoptimal arms. A large portion of our analysis is deterministic conditional on such behavior.
In other words, i (t) and i+ (t) are the two consecutive times arm i is played such that i (t) < t  i+ (t).
6


2.1 Deterministic bounds for the leader We will argue deterministically assuming that the problem instance is well-behaved. We split our argument into a chain of claims and lemmas. The proofs are quite detailed; one can skip them for the first reading. For shorthand, let E [t1 ; t2 ] denote the event that the (fixed) problem instance is wellbehaved on the time interval [t1 ; t2 ]. First, we argue that the leader's last-seen state,   (ˇ), does not decrease too much in one round. Claim 2.11. If E [t - 2; t] then  (t + 1)   (t) - 2 cwell . Proof. Assume that t is even (if t is odd the proof proceeds similarly). Recall that the leader in round t is some arm i played in one of the previous two rounds. It follows that   (t) = i (t)  ľi (t) + 2 cwell . Moreover, the leader (i.e. arm i) is played in round t and therefore   (t + 1)  ľi (t), claim proved. Second, each arm becomes active eventually. Claim 2.12. Any arm i becomes active at most  -2 rounds after it is played: iact (t) - i (t)   -2 for any time t. Proof. If t - i (t)   -2 then (4) is trivially true. Third, we show that a currently maximal arm has been activated within 4k rounds from its last-seen time, and therefore it has been played in the previous 8k rounds. The proof of this lemma is one of the crucial arguments in our analysis. Lemma 2.13. Suppose E [t -  at time t. Then
-2  

Proof. Let ľ (t) = ľi (t) for some arm i, and let  = i (t) be the last time this arm was played. By Lemma 2.13 we have t -   8k . Therefore    ( + 1)  ľi ( )  ľi (t) - cwell  8k , and the claim follows by Claim 2.11. Fifth, we show that high-priority arms are played very soon after they become active. Claim 2.15. Suppose arm i is a high-priority active arm at time t. Assume E [t -  -2 ; t]. Then t - iact (t)  4 cnear . Proof. Fix time t and let t = iact (t) be the activation time of arm i. Then by Definition 2.4 and Definition 2.3     (t ) - i (t )  csusp  t - t  csusp  4k . Using Claim 2.14 to relate   (t ) and ľ (t ), and using the fact that i (t ) = ľi ( ) and that ľi (ˇ) is well-behaved, we obtain  ) ľi (t  (8k + 15 k ) cwell . (16) Lemma follows by Definition 2.9(ii) which is, in fact, tailored to (16). Now we have the tools needed to prove a stronger version ~ of Claim 2.14: ľ (t) -   (t)  O( ). Lemma 2.16. If the problem instance is well-behaved on [t -  -2 ; t] then ľ (t) -   (t)  O(cwell  ). Proof. Let i be an active arm at time t. By Lemma 2.13 iact (t) - i (t)  4k , so at time iact (t) arm i is a high-priority active arm. By Claim 2.15 t - iact (t)  4 cnear = O(1). By Lemma 2.13 it follows that t - i (t)  O(1). Now   ( + 1)  ľi ( ) by definition of the leader;   (t)    ( + 1) - O(cwell  ) by Claim 2.11; and also ľ (t)  ľ ( ) + O(cwell  ) since the problem instance is well-behaved. Putting it together, we obtain the lemma. 2.2 Proof of The Deterministic Lemma Let  = i (t) and recall that we denote  = ľi (t). By Lemma 2.13 we have t -   8k . Since the problem instance is well-behaved on [t - 8k ; t], it follows that ľ (ˇ) is well-behaved, too, and therefore  (17) |ľi (t) - ľi ( )|  2 cwell  t -  , which immediately implies (9). To obtain (10) note that (17) in fact applies to any arm j , in particular to an arm j that is maximal at time  . To prove Lemma 2.10(b), it suffices to prove the following two inequalities:
i (t) ľi ( ) 1  ( / )2 / log  ,

; t] and arm i is maximal

iact (t) - i (t)  t - iact (t)  4k . Proof. Note that t - iact (t)  4k , since otherwise after becoming active at time iact (t) arm i would have been played strictly before round t, contradiction. Let  = i (t). For the sake of contradiction assume that iact (t) - >t- iact (t).
=

(14) iact (t) . - 1, by (15)

Since arm i is not suspicious at time t Definition 2.2 we have    (t ) - i (t )  csusp  t

-

By Claim 2.12 the problem instance is well-behaved on [ ; t]. It follows that  i (t ) = ľi ( )  ľi (t) - cwell  t -     (t ) = ľj (t )  ľj (t) + cwell  t - t , where arm j is the leader in round t , and t is one of the two rounds preceding t . Plugging this into (15) and using (14), we see that ľj (t) > ľi (t), contradiction. Fourth, we show that the leader's last-seen state is not much worse than the maximal state. Claim 2.14. If E [t -  -2 ; t] then ľ (t) -   (t)  (8k +  8k ) cwell .

(18) (19)

 O( +  log

1  ).

Proof of (18): We consider two cases. First, if we have ľi ( ) <  /2 then by (17) we obtain  2 cwell  t -   |ľi (t) - ľi ( )|   /2, and (18) follows since i (t)  t -  .


Second, assume ľi ( )   /2. Then by Lemma 2.16 for any time t  ( ; t +  -2 ) we have   (t ) - ľi ( )  ľ (t ) - O(cwell  ) + ľi ( ) - ľ ( ) t -  + O (1).   /2 - cwell   This is at least  csusp  t -  as long as it is the case that t -   (12 cwell  / )-2 . So for any such t arm i is not suspicious, proving (18). Proof of (19): First, note that if iact (t) - i (t)  4k then by Definition 2.5 arm i is a high-priority active arm at time iact (t), so by Claim 2.15 we have t - iact (t)  O(1) and so t - i (t)  O(1) by Lemma 2.13. It follows by (17) that
ľi ( )

ˇ for any time t, letting ti = Ni (t) we have < | O(t-4 ). Pr W i (t) - ľi (0)| > fi (t, ti )

(22)

The family {fi }i[k] is a padding for the problem instance. We build on an algorithm U C B 1 from [2] for the static MAB problem. We define a generalization of U C B 1, which we call U C Bf , which is parameterized by a padding f = {fi }i[k] . Algorithm 3.3 (U C Bf ). In each round t play any arm W . i  argmax i (t) + fi (t, Ni (t))
i[k]

 ľi (t) + O( ),

and we are done. In what follows we will assume that iact (t) - i (t) > 4k . Note that for any time t
 )   ) w

(20)
-

e have
-

 (t  max(ľ (t
=

1), ľ (t

2))

The original U C B 1 algorithm is defined for a specific padding f , and in fact does not explicitly uses the notion of a padding. We introduce this notion here in order to extend the ideas from [2] to our setting. We incorporate the analysis from [2] via the following lemma which, essentially, bounds the number of times a suboptimal arm is played by the algorithm. Lemma 3.4 (Auer et al. [2]). Consider an instance of the state-oblivious MAB problem with a padding f = {fi }i[k] . Consider the behavior of algorithm U C Bf in the first t rounds. Then for each arm i and any ti < t we have fi (t, ti ) 
1 2 ľi (t)

 ľ (t + 2 cwell . Let t - 1 be the round immediately preceding the activation time. Since arm i is not suspicious at time t ,  csusp  t -     (t ) - ľi ( )  ľ (t - ľi ( ) + 2 cwell .   ľi (t ) + cwell  (2 + t -  ). Since csusp = 5 cwell , it follows that
ľi (t )  )

iact (t)

 E [Ni (t)]  ti + O(1).

(23)

 + 2 cwell   4 cwell  t   ľi (t ) + 2 cwell  t 3  2 ľi (t ) + 2 cwell .

-

.

(21)

Combining (17) and (21), we obtain
ľi ( ) -

This lemma is implicit in Auer et al. [2], where it is the crux of the main proof. That proof considers the static MAB problem and (implicitly) a specific padding f . We will use U C Bf where f = {fi }i[k] is defined by 2 8 fi (t, ti ) = ln(t)/ti + i t log t. (24) Define the saverage dynamic regret of an algorithm A Ż RA (t) = 1 [t] RA (s). We prove the following guarant tee for algorithm U C Bf : Theorem 3.5. Consider the state-oblivious dynamic MAB problem with k arms. Suppose the volatility of each arm i is at most i . Then there exists time t0 such that
- Ż E [RU C Bf (t0 )]  O(k av ) log3/2 (av1 ), k 1 2 2 where av = k i=1 i .


Finally, by (17), (20) and (21) we obtain  ) ľi (t  ľi (t) + 2 cwell  t - t ) 1 ľi (t) + 2 ľi (t + 2 cwell .  2 ľi (t) + 4 cwell . ) ľi (  3 ľi (t) + 6 cwell .
ľi (t )

(25)

3 The state-oblivious dynamic MAB problem
We consider the state-oblivious dynamic MAB problem with k arms where the volatility of each arm i is at most i . Definition 3.1. For each arm i, Ni (t) is the number of times it has been played in the first t - 1 rounds, and W i (t) is the corresponding average reward. Let W i (0) = 0 if Ni (t) = 0. For shorthand, let ľi = ľi (0) be the initial state. Definition 3.2. Consider an instance of the state-oblivious dynamic MAB problem. A function fi : N × N  R+ is a padding for arm i if the following two properties hold: ˇ fi (t, ti ) is increasing in t and decreasing in ti ,

To obtain Theorem 1.4 from Theorem 3.5, we start a fresh instance of algorithm U C Bf after every t0 steps. We take advantage of the facts that (i) the "restarting times" are deterministic and, in particular, independent of the past history, and (ii) in any fixed round each ľi (t) is distributed independently and uniformly on [0; 1]. In the rest of this section we prove Theorem 3.5. We start with a very useful fact about the state evolution ľi (t). In 1 general, if ľi (0) > 2 then due to the influence of the upper boundary the expected state E [ľi (ˇ)] drifts down from its initial value. The following claim upper-bounds such drift. Let us use a shorthand for the second summand in (24): 8 i (t) = i t log t.


Claim 3.6. Fix arm i and integer times t  t . Then Pr[ |ľi (t) - ľi | > i ] < t-3  where ľi = ľi (0) and i = i (t ), and therefore E [ľi (t) | ľi ]  min(ľi , 1 - i ) - t-2 .  (27) (26)

The Chernoff-Hoeffding bounds (applied to the probability Ż space induced by conditioning on F ) say precisely that the condition (29) implies the following tail inequality: |  2 ^ Ż Pr S - ľ m| >  m + a | F 2 e-2a /m for any a  0. We obtain (28) by taking a =  2m ln T .

Proof. Recall that the state ľi (t) is defined as fI (Bi (t)) where Bi is a Brownian motion with volatility i , and fI is the "projection" (1) into the interval I = [0; 1] with reflective boundaries. Note that ľi = Bi (0). It follows that |ľi (t) - ľi | > i only if |Bi (t) - ľi | > i . We know that for any c > 1 we have  2 Pr[ |Bi (t) - ľi | > c i t] < 2 e-c /2 .  We obtain (26) setting c = 6 log t . Now let us prove (27). Define f (ľ) = E [ ľi (t) | ľi ]. Note that if ľ < 1 then f (ľ) > ľ. Also, note that f (ľ) is 2 increasing and f ( 1 ) = 1 by symmetry. Therefore, it suffices 2 2 to prove (27) under the assumption that 1 < ľi  1 - i . 2 Consider T = min(t, TB ), where TB = min{s  N : Bi (s)  (0; 1)}. Then Zs = ľi (min(s, T )) is a martingale such that Z0 = ľ and T is a bounded stopping time. By the Optional Stopping Theorem it follows that E [ZT ] = ľ. By (26) we have TB  t with probability at least 1 - t-2 , in which case T = t and  ľ = ZT = ľi (t). Thus (27) follows. Using Claim 3.6, let us argue that (24) is indeed a padding. Essentially, the first summand in (24) is tuned for an application of Chernoff-Hoeffding Bounds, whereas the second one corrects for the drift. Lemma 3.7. The family f defined by (24) is a padding. Proof. We need to prove (22). Fix arm i and time t. Let {tj } 1 be the enumeration of all times when arm i is played. j= ^ Let Xj = ľi (tj ) be the state of arm i in round t. Let Xj be the actual reward collected by the algorithm from arm i j in round tj . Let us define the sums S = [n] Xj and j  ^ j , where n = Ni (t) is the number of times S= X
[n]

To argue about algorithm U C Bf , we will use the following notation: Definition 3.8. We will use the following notation:  ľi = ľi (0), i (t) = min(ľi , 1 - i (t)), i = ľ - ľi , ľ = ľ (0) S (t) = {arms i : i  4i (t)}. Lemma 3.9. Consider any algorithm for the state-oblivious dynamic MAB problem. Then for each arm i and time t  k N  E i (t) W i (t) | ľi i (t) E [Ni (t)] - t-2 . (30) The left-hand side of (30) is the total winnings collected by arm i up to time t. If the bandit algorithm always plays arm i, then s i (t) = t and the left-hand side of (30) is simply N equal to E [ľi (s)], so the lemma follows from Claim 3.6. In this sense, Lemma 3.9 is an extension of Claim 3.6. The proof of (30) is a rather intricate exercise in conditional expectations and martingales. We defer it to Section 3.1. We combine Lemma 3.9 and Lemma 3.4 to derive a conŻ ditional bound on RU C Bf (t): Corollary 3.10. For any time t we have  Ż E RU C Bf (t) | ľ1 , . . . , ľk   i k + O(1)  ľ - i (t) t2 S(t)   i 1 + O( 1 log t)  . (31) t i
S (t)

arm i is played before time t. Let ľ = ľi (0) and  = i (t). We can rewrite (22) as follows:  (28) Pr[|S  - ľ n| > 2n ln t +  n] < O(t-4 ). Let F be the failure event when |ľi (s) - ľ| >  for some s  [t]. Recall that by Claim 3.6 the probability of F is at most t-4 . In the probability space induced by conditioning ^ ^ Ż on X1 , . . . , Xj -1 and the event F , we have E = ^ ^ E [Xj ] = E [Xj |tj , Xj ] E [ E [Xj |tj ] ] = E [Xj ]  [ľ - , ľ +  ]. Going back to the original probability space, ^^ ^ Ż E [Xj | X1 . . . Xj -1 , F ]  [ľ - , ľ +  ]. (29)

Proof. Fix time t and let W i = W i (t), i = i (t) and Ni = Ni (t). Let R(t) be the left-hand side of (31). Using (30), i t R(t) = E [(ľ - W i ) Ni ]
[k]


i
[k]

E [Ni ] (ľ - i ) + t-2 .

For each i  S (t) we have ľ - i  2i and by Lemma 3.4 E [Ni (t)]  32 ln(m)/2 + O(1). i We obtain Theorem 3.5 by integrating both sides of (31) with respect to ľ1 . . . ľk . Proof of Theorem 3.5: Fix time t and let i = i (t) and i = i (t). Note that (31) is, essentially, the sum over all arms. We partition the arms into three sets and bound the three corresponding sums separately.


Note that the following holds for any fixed  > 0: given ľ and the event {i >  }, the random variable ľi is distributed uniformly on the interval [0; ľ -  ). We will use this property in the forthcoming integrations. First, we consider the set S = S (t). Conditional on ľ , i = i  P -1 E i E -1 | i > 4i r[i > 4i ] i
S [k]

Taking expectations on both sides, we obtain E [s Xs ] = E [s Ys ], which proves (35). Going from (35) to (36) is somtewhat more complicated. In what follows we denote S = [m] s Ys . Claim 3.11. If ľ  1 -  then E [S ]  ľ E [N ] - t-2 . Proof. As in Claim 3.6, we recall the definition ľi (s) = fI (Bi (s)) where Bi is a Brownian motion with volatility i , and fI is the "projection" (1) into the interval I = [0; 1] with reflective boundaries. Note that ľi = Bi (0). For brevity, denote Ys = Bi (s), and define the corres Y sponding shorthand S = [t] s s . Let F be the failure event when Ys  1 for some t  m. Note that if this event does not occur, then Ys  Ys for every time t  [m] and therefore S  S . We use this observation to express E [S ] in terms of E [S ]. Let p := Pr[F ] and note that it is at most m-4 . Then: E [S ] = (1 - p) E [S | not F ] + p E [S |F ]  (1 - p) E [S | not F ] + p(ľ + ti ) E [S ]  (1 - p) E [S | not F ] + p E [S |F ]  (1 - p) E [S | not F ]  E [S ] - pti - p. To prove the claim, it remains to bound E [S ]. Let {sj } 1 be the enumeration of all times when arm i j= is played. Note that N = max{j : sj  t}. Define Zj = Ysj for each j . We would like to argue that {Zj } 1 is a martinj= gale and N is a stopping time. More precisely, claim that this is true for some common filtration. Indeed, one way to define such filtration {Fj } 1 is to define Fj as the  -algebra j= generated by sj +1 and all tuples (sl , Zl , Zl , Zl ) such that l  j . Now using the Optional Stopping Theorem one can show that j Z E [S ] = [N ] Zj = E [N ] E [ 0 ], which proves the claim since Z0 = ľ. To prove (36), it remains to consider the case ľ > 1 -  . Claim 3.12. if ľ > 1 -  then E [S ]  (1 -  ) E [N ] - t-2 . Proof. Let T be the smallest time s such that Ys  1 -  . Let {sj } 1 be the enumeration of all times when arm i is j= played, and let J = max j : tj  T . Conditioning on T and J , consider the entire problem starting from time T + 1. Then by Claim 3.11 we have: m  s E s Ys |T , J (1 -  )(E [N ] - J ) - t-2 .
=T +1


i
[k]

- ln i 1

-  O(k ln av1 ).

(32)

Second, let us consider the set S + of all arms i such that 0 < i < 4i . Conditional on ľ , we obtain i  i  E ľ - i O(i ) Pr[i < 4i | i > 0]
S + [k]


i

[k ]

O(i ) min(1, i /ľ ). i
2 O(i )

Integrating over ľ , we obtain i   E S + ľ - i 

[k] 2 O(k av

t log t).

(33)

Third, we consider the set S  of all maximal arms, i.e. the set of all arms i such that i = 0. We show the main steps of the argument, omitting the details of some straightforward integrations: Zi := I{i =0} (ľ - i ) E [Zi ] = E [E [Zi |ľ ]] = i E
S  1 k 2 E [ľ - i ] = O(i )

= ľ - i


i
[k ]

2 E [Zi ]  O(k av )(t log t). (34)

Finally, using( 32-34), we take expectations in (31): = Ż - O( k log t)((av t)2 + log av1 ). E RU C Bf (t) t l - The theorem follows if we take t0 = av og av1 . 3.1 Proof of Lemma 3.9: conditional expectations Fix arm i and time t. Let us introduce a more concise notation which gets rid of the subscript i. Let ľ = ľi (0) and  = i (t), and denote N = Ni (t). For every time s, let Ys = ľi (s), and let Xs be the winnings from arm i at time s if it is played by the algorithm.8 Let s be equal to 1 if arm i is played at time s, and 0 otherwise. To prove (30), we will show that s = s ( E s Xs E s Ys 35) [t] [t]  min(ľ, 1 -  ) E [N ] + t-2 . (36) Note that s and Xs are conditionally independent given Ys . It follows that E [s Xs |Ys ]
8

= E [s |Ys ] E [Xs |Ys ] = E [s |Ys ] Ys = E [s Ys |Ys ] .

t Let ST = s=T +1 s Ys . It follows that t S = ST + [T ] s Ys  ST + (1 -  ) J E [S ] = E [ST ] + (1 -  )E [J ]  (1 -  ) E [N - J ] - t-2 + (1 -  ) E [J ]  (1 -  ) E [N ] - t-2 .

That is, Xs is an independent random sample from distribution D(Ys ), as defined in Section 1.1.


4 Using off-the-shelf algorithms
In this section we investigate the following idea: take an offthe-shelf MAB algorithm, run it, and restart it every fixed number of rounds. We consider both the state-informed and state-oblivious versions of the dynamic MAB problem. We use the following notation: there are k arms, each arm i has volatility i , and the average volatility av is dek 1 2 2 fined by av = k i=1 i . We rely on the following lemma: Lemma 4.1. Let ľ = ľ (0) and let i  argmax ľi (0), ties broken arbitrarily. Then for any times t  t
2 E [ľ - ľi (t)]  O(k )(t-4 + av t log t ). 

We provide a matching lower bound. Theorem 4.3. Consider the setting in Theorem 4.2. Then ~ the steady-state regret of the greedy algorithm is (k av ). Proof Sketch. For simplicity assume i   . It is known that in time t a Brownian motion with volatility  drifts by at  ~ least  = ( t) with high probability. Thus for each arm i with high probability ľi (t)  1 - /2, regardless of the initial value ľi (0). Now we can obtain a lower bound that corresponds to Lemma 4.1: letting ľ = max ľi (i) and i  argmax ľi (i) be the arm chosen by the greedy algorithm, ~ E [ľ - ľi (t)]  (k  2 t), (38)

(37)

More generally, we can consider arbitrary fixed times 0  t1  t2  . . .  tk  t  t and define ľ = max ľi (ti ) and i  argmax ľi (ti ). The lemma is obtained, essentially, by combining Claim 3.6 and (34); we omit the details of the proof. Remark. The intuition is that each arm i is probed in round ti , so that ľi (ti ) is the expected value of the corresponding probe. This lemma is similar to Claim 3.6 in that it bounds the downwards drift of E [ľi (ˇ)] which is caused by the proximity of the upper boundary. The difference is that here we specifically consider a "maximal" arm, e.g. when ti  0 we consider an arm which is maximal at time 0. 4.1 State-informed version: greedy algorithm For the state-informed version we consider a very simple, "greedy" approach: probe each arm once, choose one with the largest state, play it for a fixed number m - k of rounds, restart. Call this a greedy algorithm with phase length m. Theorem 4.2. Consider the state-informed dynamic MAB problem with k arms such that the volatility of each arm i is l - i . With phase length m = av og av1 , the steady-state regret of the greedy algorithm is at most k - 2 1 2 O(k av log av1 ), where av = k i=1 i . Proof. For the algorithmic result, fix phase length m > k and consider a single phase of the greedy algorithm. Assume without loss of generality that in the first k rounds of the phase our algorithm plays arm i in step i. Let ľi = ľi (i) be the corresponding rewards, and let ľ be the largest of them. Then the greedy algorithm chooses arm i  argmaxi[k] ľi and plays it for m - k rounds. Consider the t-th of these m - k rounds and let Yt = ľi (t + k ) be the state of arm i in this round. By Lemma 4.1 we have E [Yt ]  E [ľ ] - z , where z is the right-hand side of (37). Therefore, letting W be the per-round average reward in this phase, we have m-k 1 - E [W ]  m t=1 Yt  mm k (E [ľ ] - z ) E [ľ - W ]  z +
k m

for any t > k . Now consider a given phase of the greedy algorithm. In the first k rounds the algorithm accumulates regret (k ), and in each subsequent round t the regret is the left-hand side of (38). The theorem follows easily. 4.2 State-oblivious version via adversarial MAB For the state-oblivious dynamic MAB problem, we use a very general result of Auer et al. [3] for the adversarial MAB problem. For simplicity, here we only state this result in terms of the present setting. Let W A (t) be the average reward collected by algorithm A during the time interval [1; t]. Theorem 4.4 (Auer et al.[3]). Consider the state-oblivious dynamic MAB problem with k arms. Let Ai be an algorithm that plays arm i at every step. Then there exists an algorithm, call it E X P 3, such that for any arm i and any time t E [W E X P 3 (t)]  E (W Ai (t)) - O( k log t)1/2 . t For our problem, we restart E X P 3 every m steps, for some fixed m; call this algorithm E X P 3(m). Theorem 4.5. Consider the state-informed dynamic MAB problem with k arms such that the volatility of each arm i is at most i . Then there exists m such that algorithm E X P 3(m) has steady-state regret k - 2 1 2 O(k av log av1 )2/3 , where av = k i=1 i . Proof. Let use shorthand A = E X P 3(m). Let ľ be the maximal expected reward at time 0, and suppose it is achieved by some arm i . Let A be the algorithm that plays this arm at every step. Let Yt = ľi (t) the state of arm i in round t. Then by Lemma 4.1 we have E [Yt ]  E [ľ ] - zm , where z (m) is the right-hand side of (37). Therefore: = E [W A (m) | Y1 , . . . , Ym ] E [W A (m)] = E m 1 t=1 Yt ] mE[  ľ - z (m) + ľ Ż E [RA (m)] = E  - W A (m) W E A (m) - W A (m) ow using (39) and Theorem 4.4 we obtain
k Ż E [RA (m)]  z (m) + O( m log m)1/2 .

(39)

E [ľ ]
k m (1

N +
1 m)

2  O(k m av log m) + l - = O(k av ) og av1

(40)

for m = av

l

og

- av1 .

We choose m that minimizes the right-hand side of (40).


We note in passing that we can also get non-trivial (but worse) guarantees for the state-oblivious dynamic MAB problem using two other off-the-shelf approaches: ˇ a version of the greedy algorithm which probes each arm a few times in the beginning of each phase, ˇ a version of Theorem 4.4 in which the benchmark algorithm is allowed to switch arms a few times [3]. Essentially, the first approach is too primitive, while the second one makes overly pessimistic assumptions about the environment. In both cases we obtain guarantees of the form ~ O(k av ) ,  < 2 , which are inferior to Theorem 4.5. 3

5 Extensions
Recall that the state evolution of arm i in the dynamic MAB problem is described by (2), where the i.i.d. increments ľi (t) are distributed with respect to some fixed distribution Xi . Can we relax the assumption that Xi is normal? Definition 5.1. Random variable X is stochastically (,  )bounded if its moment-generating function satisfies E [er(X -E [x]) ]  er
2

should look like a weighted sum of contributions from different arms, where the weights depend (perhaps in rather complicated way) on the respective fundamental intervals. To illustrate this point, we worked out the guarantees for the two algorithms discussed in Section 4, see Appendix A for details. It is an open question to derive similar closed-form guarantees for the other algorithms in this paper. Recall that in all our results we assumed that the volatilities are known to the algorithm. In fact, this assumption is not necessary: we are interested in the stationary performance of our algorithms and, as it turns out, we can afford to learn the static parameters of the model. Roughly, the argument goes as follows. It suffices for our analysis if for each arm an algorithm knows a 2-approximate upper bound on volatility i , rather than the exact value. One can learn such bound by playing arm i for O(log2 i ) rounds, with failure - probability as low as O(i 10 ), and repeat this learning phase -1 every i rounds (we omit the details). Acknowledgments. The first author would like to thank Bobby Kleinberg for many stimulating conversations about multi-armed bandits.

 2 /2

for |r|  .

References
[1] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learning Research, 3:397­422, 2002. Preliminary version in 41st IEEE FOCS, 2000. [2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(23):235­256, 2002. Preliminary version in 15th ICML, 1998. [3] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48­77, 2002. Preliminary version in 36th IEEE FOCS, 1995. [4] B. Awerbuch and R. D. Kleinberg. Adaptive routing with end-to-end feedback: distributed learning and geometric approaches. In 36th ACM Symp. on Theory of Computing (STOC), pages 45­53, 2004. [5] D. A. Berry and B. Fristedt. Bandit problems: sequential allocation of experiments. Chapman and Hall, 1985. [6] D. Bertsimas and J. Nino-Mora. Conservation laws, extended polymatroids and multi-armed bandit problems: A unified polyhedral approach. Math. of Oper. Res, 21(2):257­306, 1996. [7] D. Bertsimas and J. Nino-Mora. Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Operations Research, 48(1):80­90, 2000. [8] V. Dani and T. P. Hayes. How to beat the adaptive multiarmed bandit. Technical report. Available from arXiv at http://arxiv.org/cs.DS/0602053, 2006. [9] V. Dani and T. P. Hayes. Robbing the bandit: less regret in online geometric optimization against an adaptive adversary. In 17th ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 937­943, 2006. [10] A. Flaxman, A. Kalai, and H. B. McMahan. Online Convex Optimization in the Bandit Setting: Gradient Descent without a Gradient. In 16th ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 385­394, 2005. [11] J. C. Gittins. Bandit processes and dynamic allocation indices (with discussion). J. Roy. Statist. Soc. Ser. B, 41:148­177, 1979. [12] J. C. Gittins. Multi-Armed Bandit Allocation Indices. John Wiley & Sons, 1989.

This is precisely the condition needed to establish an Azumatype inequality: if S is the sum of t independent stochastically (,  )-bounded random vari les with zero mean, then ab ~ ( t). Specifically, for any with high robability S  O p   1   t we have 2 S  Pr >  t exp(-2 /2). (41) Note that a normal distribution N (0,  ) is (,  )-bounded, and any distribution with support [-,  ] is (1,  )-bounded. We can recover all of our algorithmic results if we assume that each distribution Xi has zero mean and is stochastically (, i )-bounded for some i , where  > 0 is a fixed absolute constant. We re-define the volatility of arm i as the infimum of all  such that Xi is (,  )-bounded. It is appealing to tackle a more general setting when the only restriction on each distribution Xi is that it has mean 2 0 and variance i . We can extend our analysis (at the cost of somewhat weaker guarantees) if we further assume that, essentially, the absolute third moment of Xi is comparable 3 to i . Then instead of (41) we can use a weaker inequality called the non-uniform Berry-Esseen theorem [23]:  ( t ,   Pr O i )3 t1-3 (42) s=1 ľi (s) > i t i for any  > 1/2, where 3 = E [ |ľi (s)|3 ]. We omit further i discussion of this extension from the present version. Let us discuss one other direction in which our setting can be generalized. Recall that in the dynamic MAB problem the state of each arm evolves on the same interval I = [0; 1] (see Section 1.1) which we term the fundamental interval. What if we allow each arm to have a distinct fundamental interval? All our algorithms fit this extended setting with little or no modification. The performance guarantees


[13] J. C. Gittins and D. M. Jones. A dynamic allocation index for the sequential design of experiments. In J. G. et al., editor, Progress in Statistics, pages 241­266. North-Holland, 1974. [14] S. Guha and K. Munagala. Approximation algorithms for partial-information based stochastic control with Markovian rewards. In 48th Symp. on Foundations of Computer Science (FOCS), 2007. [15] S. Guha, K. Munagala, and P. Shi. On Index Policies for Restless Bandit Problems. arXiv:0711.3861v1 [cs.DS], 2007. [16] F. Heidari, S. Mannor, and L. Mason. Reinforcement learning-based load shared sequential routing. In IFIP Networking, 2007. [17] R. D. Kleinberg. Nearly tight bounds for the continuumarmed bandit problem. In 18th Advances in Neural Information Processing Systems (NIPS), 2004. Full version appeared as Chapters 4-5 in [18]. [18] R. D. Kleinberg. Online Decision Problems with Large Strategy Sets. PhD thesis, MIT, Boston, MA, 2005. [19] R. D. Kleinberg. Anytime algorithms for multi-armed bandit problems. In 17th ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 928­936, 2006. Full version appeared as Chapter 6 in [18]. [20] R. D. Kleinberg and F. T. Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In 44th Symp. on Foundations of Computer Science (FOCS), pages 594­605, 2003. [21] T. Lai and H. Robbins. Asymptotically efficient Adaptive Allocation Rules. Advances in Applied Mathematics, 6:4­22, 1985. [22] H. B. McMahan and A. Blum. Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary. In 17th Conference on Learning Theory (COLT), pages 109­ 123, 2004. [23] K. Neammanee. On the constant in the nonuniform version of the Berry-Esseen theorem. Intl. J. of Mathematics and Mathematical Sciences, 2005:12:1951­1967, 2005. [24] J. Nino-Mora. Restless bandits, partial conservation laws and indexability. Advances in Applied Probability, 33:76­98, 2001. [25] S. Pandey, D. Agarwal, D. Chakrabarti, and V. Josifovski. Bandits for Taxonomies: A Model-based Approach. In SIAM Intl. Conf. on Data Mining (SDM), 2007. [26] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of optimal queueing network control. In Structure in Complexity Theory Conference, pages 318­322, 1994. [27] H. Robbins. Some Aspects of the Sequential Design of Experiments. Bull. Amer. Math. Soc., 58:527­535, 1952. [28] J. S. Rosenthal. Markov chain convergence: From finite to infinite. Stochastic Processes Appl., 62(1):55­72, 1996. [29] R. K. Sundaram. Generalized Bandit Problems. In D. AustenSmith and J. Duggan, editors, Social Choice and Strategic Decisions: Essays in Honor of Jeffrey S. Banks (Studies in Choice and Welfare), pages 131­162. Springer, 2005. First appeared as Working Paper, Stern School of Business, 2003. [30] J. N. Tsitsiklis. A short proof of the Gittins index theorem. Annals of Applied Probability, 4(1):194­199, 1994. [31] G. Weiss. Branching bandit processes. Probab. Engng. Inform. Sci., 2:269­278, 1988. [32] P. Whittle. Arm acquiring bandits. Ann. Probab., 9:284­292, 1981. [33] P. Whittle. Restless bandits: Activity allocation in a changing world. J. of Appl. Prob., 25A:287­298, 1988.

which we term the fundamental interval. In this section we consider a generalization in which we allow each arm to have a distinct fundamental interval. We work out the guarantees for the two algorithms discussed in Section1.4. The main contribution of this appendix is that we find a way to upper-bound the steady-state regret of the respective algorithms in terms of reasonably defined averages of the arms' properties. The actual derivations are rather tedious but not that illuminating; we omit them from this version. A.1 The setting and notation We consider the following setting. There are k arms. Each arm has volatility i and fundamental interval [ai ; bi ]. Without loss of generality we assume that b1  . . . bk and that max ai < min bi . (If the latter fails then we can always ignore the arm with the smallest upper boundary bi .) To sim1 plify the derivation we assume that max i  3 . Define the weight of arm i as wi = l k bi - al , bl - al
=i

Define the average volatility av by i 2 [k] wi (bi - ai ) i 2 i av = [k] wi (bi - ai ) Define the average length as i 1 dav = k [k] wi (bi - ai ). To see that the quantities we defined above are reasonable as averages, note that if all arms have the same fundamental interval [a; b] then all weights are 1 and dav = b - a and, moreover, the average volatility av coincides with the one defined in the body of the paper. A.2 Results We present two results that extend, respectively, Theorem 4.2 and Theorem 4.5 to the setting from Section A.1. In both cases the algorithms are exactly the same. The main tool is a version of Lemma 37, where the guarantee (37) looks exactly the same in our notation, except the right-hand side is multiplied by dav . Theorem A.1. Consider the deterministic dynamic MAB problem in the setting from Section A.1. Let amin = min ai . Then for phase length ( - - m = av1 bk - amin )/ log av1 the greedy algorithm has steady-state regret ( - O(k av ) bk - amin ) dav log av1 . Theorem A.2. Consider the state-informed dynamic MAB problem in the setting from Section A.1. Then there exists m such that algorithm E X P 3(m) has steady-state regret
- O(dav )1/3 (k av log av1 )2/3 .

A

Distinct fundamental intervals

Recall that in the dynamic MAB problem the state of each arm evolves on the same interval I = [0; 1] (see Section 1.1)