Piecewise-stationary Bandit Problems with Side Observations

Jia Yuan Yu jia.yu@mcgill.ca Department Electrical and Computer Engineering, McGill University, Montr´al, Qu´bec, Canada. e e Shie Mannor shie.mannor@mcgill.ca; shie@ee.technion.ac.il Department Electrical and Computer Engineering, McGill University, Montr´al, Qu´bec, Canada. e e Department of Electrical Engineering, Technion, Haifa, Israel.

Abstract
We consider a sequential decision problem where the rewards are generated by a piecewise-stationary distribution. However, the different reward distributions are unknown and may change at unknown instants. Our approach uses a limited number of side observations on past rewards, but does not require prior knowledge of the frequency of changes. In spite of the adversarial nature of the reward process, we provide an algorithm whose regret, with respect to the baseline with perfect knowledge of the distributions and the changes, is O(k log(T )), where k is the number of changes up to time T . This is in contrast to the case where side observations are not  available, and where the regret is at least ( T ).

n arms has a fixed reward distribution, and where the agent tries to obtain the performance of the best arm by picking and observing one arm each time step-- without observing the reward of any other arm. We consider a model that combines the bandit and expert models, and shall refer to the arms of the bandit and the experts interchangeably. The reward process of the arms is non-stationary on the whole, but stationary on intervals. This piecewise-stationary reward process is similar to that of the non-stationary bandit problem of (Hartland et al., 2006; Garivier & Moulines, 2008), or that of the multiple change-point detection problem of (Akakpo, 2008). In our variant of the non-stationary bandit problem, the agent has the benefit of querying and observing some of the past outcomes of arms that have not been picked. This is the same benefit available to the agent in the expert problem (cf. Herbster & Warmuth, 1998). The following examples motivate our model. Example 1.1 (Investment options). Consider the problem of choosing every day one of n investment options, say mutual funds. Our model assumes that the outcomes of these investments undergo changes reflecting changes in market conditions. Otherwise, the outcomes remains stationary over the periods between two changes, e.g., they follow bearish or bullish trends. Suppose that the outcomes of the previous day's investment options are revealed today, e.g., in the newspaper. Suppose that observing the outcome of each option requires a query (looking up a price history), which incur a querying cost. By limiting the number of queries allowed at each step, we can model the trade-off between the cost of observations and the regret due to insufficient observations. Example 1.2 (Dynamic pricing with feedback). As a second example, we consider a vendor whose task is to sell commodity X. Potential customers arrive sequentially, one after the other, and the demand for

1. Introduction
In some learning scenarios, the agent is confronted with an adversarial opponent that can be very general and difficult to model, and is therefore modelled as an arbitrary non-stochastic process. In other scenarios, the opponent is stochastic, which may be characterized and adapted to. What about opponents that fall between these two extremes? An instance of the adversarial scenario is the expert problem (Littlestone & Warmuth, 1994), where the agent observes sequentially the performance of a number of experts, and (choosing one expert at each time step) tries to match the performance of the best expert in retrospect. An instance of the stochastic scenario is the multi-armed bandit problem (Lai & Robbins, 1985), where each of
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Piecewise-stationary Bandit Problems with Side Observations

commodity X (for various prices) is modelled as a stationary process that may nonetheless change abruptly at unknown instants. To each customer, the vendor offers one of n possible prices. If the customer accepts, a corresponding profit is made. Bargaining is not an option, but after each transaction, the vendor has the leisure to ask the customer is the outcome would have been different had a different price been offered (e.g., through a short survey). A partial goal is to achieve as much profit as if the distribution of the demand were known at all times (even though unknown changes occur at unknown instants). A second goal is to minimize the cost associated with conducting surveys for feedback. A similar problem of dynamic pricing with partial-monitoring is also described in (Cesa-Bianchi & Lugosi, 2006). We present the setting in Section 2, followed by a survey of related works in Section 3. We present a solution and its guarantee in Section 4. In Section 5, we compare our solution with other solutions via simulation. In Section 6, we conclude with a discussion.

Similarly to adversarial learning problems (cf. (CesaBianchi & Lugosi, 2006)), both the change-points 1 , 2 , . . . and the distributions f1 , f2 , . . . are unknown. We can think of an opponent deciding the time instants (and frequency) of the changes, as well as the distribution after each change. Remark 1. It is important that the changes occur at arbitrary instants. Otherwise, we only need to reset an algorithm for the multi-armed bandit problem at the appropriate instants. The model of piecewise-stationary rewards combines two important models. If there are no changes, then we recover the stochastic source of the multi-armed bandit problem. If there is no constraint on the number of changes, we obtain the source of rewards adopted by the adversarial model of prediction with expert advice. We consider the interesting case where the frequency of changes is between these two extremes, i.e., where the number of change-points
T -1

k(T )
t=1

1[ft =ft+1 ]

2. Setting
We consider the following sequential decision problem. Let {A1 , . . . , An } denote the n arms of a multiarmed bandit--or n experts of an online learning problem. Let b1 , b2 , . . . be a sequence of reward vectors in Rn . The element bt (i) of bt , for i = 1, . . . , n and t = 1, 2, . . ., represents the reward associated with the i-th arm Ai at time t. With an abuse of notation, we shall write bt (Ai ) interchangeably with bt (i). We assume that the rewards take values in the unit interval [0, 1], i.e., bt (i)  [0, 1] for all i and t. 2.1. Reward Process In our model, the source of rewards is piecewisestationary: i.e., it changes its distribution arbitrarily and at arbitrary time instants, but otherwise remains stationary. The reward process b1 , b2 , . . . is an independent sequence of random variables that undergoes abrupt changes in distribution at unknown time instants 1 , 2 , . . ., which are called change-points. By convention, we let 1 = 1. Let ft denote the distribution (probability density function) of bt . Hence, b1 , . . . , b2 -1 are i.i.d. with common distribution f1 , as is the case for stochastic learning problems (cf. Lai & Robbins, 1985). Likewise, bj , bj +1 , . . . , bj+1 -1 are i.i.d. with distribution fj , for j = 1, 2, . . .. The intervals are illustrated as follows: b1 , b2 , . . . , b2 -1 , b2 , . . . , b3 -1 , . . . , bj , . . . , bj+1 -1 . . . .
distribution f1 distribution f2 distribution fj

up to time T increases with T . To simplify notation, we shall simply write k in place of k(T ). 2.2. Decision-maker At each time step t > 1, the agent picks an arm at  {A1 , . . . , An } and makes  (where 1    n) observations on the individual arm-rewards bt-1 (1), . . . , bt-1 (n). This is captured in the following assumption. Assumption 2.1 (Partial observation). At time 1, the agent chooses an action a1 and an -subset S1 of the arms {A1 , . . . , An }. At every time step t > 1, the agent chooses (deterministically) an -subset St and takes an action at that is a function of the reward observations {bj (i) | j = 1, . . . , t - 1, Ai  Sj }.

Partial observation allows us to capture querying costs associated with observations, and to quantify the total query budget. 2.3. Notion of Regret At each time instant t, the agent chooses and activates an arm at  {A1 , . . . , An } and receives the corresponding reward bt (at ). Let t denote the mean of the reward vector bt . The agent's baseline--or objective--is the reward accumulated by picking at each instant t an arm with the maximal expected reward. Letting k

Piecewise-stationary Bandit Problems with Side Observations

denote the number of changes in reward distribution up to time T , the baseline is
T t=1 i=1,...,n T

max t (i) =

1 ,...,T : k changes

max

E[bt (t )],
t=1

where the maximum is taken over sequences of arms with only as many changes as change-points in the reward sequence b1 , . . . , bT , i.e., over the set {1 , . . . , T | j = . . . = j+1 -1 for j = 1, . . . , k}.

The sequence of rewards achieved by the experts is arbitrary; i.e., no assumption is made regarding the joint distribution of b1 , b2 , . . .. This approach essentially makes provisions for the worst-case sequence of reward. At time t, the past reward vectors b1 , . . . , bt-1 are observable by the agent. In this case, the notion of adversarial regret is adopted, whose baseline is the reward accumulated by the best fixed expert, i.e., T maxi=1,...,n t=1 bt (i). For every sequence b1 , b2 , . . ., the (expected) adversarial regret
T i=1,...,n T

Despite the appearance, this objective is reasonable when the number of changes k is small; it is also the same objective as in the classical stochastic multiarmed bandit problems. Hence, for a given reward process b1 , b2 , . . ., we define the expected regret of the agent at time T as
T T i=1,...,n

max

t=1

bt (i) -

E[bt (at )]
t=1

is of the order of O( T log(n))--see (Cesa-Bianchi & Lugosi, 2006) for a detailed account. A similar bound holds when the observations are limited to the chosen arms: b1 (a1 ), b2 (a2 ), . . . (Auer et al., 2002b). The baseline in the adversarial case is limited to a single fixed expert, whereas our baseline in (1) is the optimal expected reward. Our baseline, which contains as many switches as changes in distribution, is similar to the baseline defined by appropriately chosen shifting policies in (Herbster & Warmuth, 1998). The fixed-share algorithm or one of its variants (Herbster & Warmuth, 1998; Auer et al., 2002b) can be applied to our setting, if the number of changes k(T ) is given in advance, yielding a regret of O( nkT log(nT )) . We present an algorithm with a regret of O(nk log(T )) without prior knowledge of k(T ). It should be noted that when k(T ) is of the same order as T , it is hopeless to minimize the regret of (1): consider an adversary that picks the new distribution after each changepoint. 3.3. Non-stationary Bandits Our problem is reminiscent of the non-stationary bandit problem of (Hartland et al., 2006; Garivier & Moulines, 2008). The reward process and the notion of regret are similarly defined, as in Section 2. However, in those works, observation of the past rewards is limited to the chosen arms; hence, at time t, the agent's choice at is a function of b1 (a1 ), b2 (a2 ), . . .. Using a statistical change detection test, Hartland et al. present a partial solution for instances where the best arm is not superseded by another arm following a change. In the event that an oracle reveals a-priori the number of changes k(T ) up to time T , Garivier and Moulines provide solutions that achieve a regret  of O(n kT log(T )); a lower-bound of ( T ) is also shown. With respect to the above non-stationary bandit model, the distinguishing feature of our model is that,

RT
t=1

max t (i) -

E[bt (at )],
t=1

(1)

where the expectation E is taken with respect to the sequence b1 , b2 , . . ..

3. Related Works
In this section, we survey results concerning related models. The different models are distinguished by the source of the reward process, the observability of the rewards, and the baseline for the notion of regret. 3.1. Stochastic Multi-armed Bandit In stochastic multi-armed bandit problems (Lai & Robbins, 1985; Auer et al., 2002a), the reward sequence b1 , b2 , . . . is a sequence of i.i.d. random vectors from a common unknown distribution 1 = 2 = . . .. The reward observations are limited to rewards b1 (a1 ), b2 (a2 ), . . . corresponding to the arms chosen by the agent. This invites the agent to trade-off exploring the different arms to estimate their distributions and exploiting the arms with the highest empirical reward. The notion of regret is the same as ours (1). However, the optimal reward of the baseline can be obtained by a single fixed arm. In such problems, the optimal expected regret is of the order of O(n log(T )), which may be obtained by a number of algorithms, e.g., (Lai & Robbins, 1985; Auer et al., 2002a; Kocsis & Szepesv´ri, 2006). a 3.2. Adversarial Expert Problem Many learning problems take the adversarial setting, e.g., prediction with expert advice, etc.--see (CesaBianchi & Lugosi, 2006) for a comprehensive review.

Piecewise-stationary Bandit Problems with Side Observations

in addition to activating an arm at each time instant, the agent may query the past reward of one or more arms. We show that with T queries in total, the regret is bounded by O(nk log(T )). Hence, queries reduce significantly the regret with respect to the results of (Garivier & Moulines, 2008). 3.4. Change Detection Another problem related to ours is that of fault detection-isolation (Lai, 2001). In that problem, the goal is to detect the change, and classify the postchange distribution within a finite set of possibilities. In our problem, the distribution after the change can be arbitrarily different. Moreover, we do not classify; instead, we apply a minimum-regret learning algorithm. In contrast to the change-detection literature, we consider a more complex setting, where the sequence b1 , b2 , . . . may have multiple (arbitrarily many) changes. The problem of joint-detection of multiple change-points is addressed in (Akakpo, 2008) and references therein.

the arm Ai at the change-point j+1 . There exists a known value  > 0 such that, for each j = 1, 2, . . ., there exists an arm Ai such that j (i) - j+1 (i) > 2. 4.1. The WMD Algorithm Our algorithm (Figure 1) detects changes in the mean of a process, in the spirit of statistical methods for detecting an abrupt change of distribution in an otherwise i.i.d. sequence of random variables (see (Lai, 2001) for a survey). The algorithm partitions the time horizon into intervals of equal length  . Hence, for m = 1, 2, . . ., the m-th interval is comprised of the time instants (m - 1) + 1, (m - 1) + 2, . . . , m . The algorithm computes iteratively empirical mean vectors b1 , b2 , . . . over intervals (windows) of length  , in the following fashion: b1 , b2 , . . . , b , b +1 , . . . , b2 , . . . , b(m-1) +1 , . . . , bm . . . .
b1 b b2 b bm b

4. Multi-armed Bandits with Queries
In this section, we present an algorithm for our setting and provide its performance guarantee. We begin with two assumptions. We shall use as a component of our solution a typical multi-armed bandit algorithm described in the first assumption. The second assumption describes a limitation of our algorithm. Assumption 4.1 (MAB algorithm for k = 1). Consider a multi-armed bandit where there are no distribution changes (except at time 1). Let the i.i.d. reward sequence b1 , b2 , . . . have distribution . Let Ai(1) and Ai(2) denote, respectively, the arm with the highest and second-highest mean. Let  denote their mean difference:  = (i(1) ) - (i(2) ). Let A be an algorithm that guarantees a regret of at most Cn log(T )/, for some constant C. At each step t > 1, algorithm A receives as input the reward bt-1 (at-1 ) obtained in the previous step, and outputs a new arm choice at . Examples of candidate algorithms include those of (Lai & Robbins, 1985; Auer et al., 2002a). In this paper, we are concerned with detecting abrupt changes bounded from below by some threshold; we exclude infinitesimal changes in the following assumption. Assumption 4.2. Recall that j (i) and j+1 (i) denote the pre-change and post-change distributions of

The algorithm follows a multi-armed bandit algorithm A with a regret guarantee in the absence of changes (Assumption 4.1). When it detects a mean shift with respect to a threshold given by Assumption 4.2, it reset the sub-algorithm A. 4.2. WMD Regret The following theorem bounds the expected regret of the WMD algorithm. Theorem 4.1 (WMD regret). Suppose that Assumption 2.1 holds. Suppose that the agent employs the WMD algorithm with a sub-algorithm satisfying Assumption 4.1, a threshold  satisfying Assumption 4.2, and intervals of length  =  n  ·  log(T ) . Then, for  22 every sequence of change-points 1 , 2 , . . . and every choice of post-change distributions f1 , f2 , . . ., the expected regret is bounded as follows: RT  C 6C 2 7 kn log(T ) + kn log(T ) + n , 2     (2)

where C is the constant of Assumption 4.1. Remark 2. The WMD algorithm does not require prior knowledge of the number of distribution changes k(T ). Remark 3 (Query-regret trade-off). The bound of Theorem 4.1 indicates a way to trade-off the number of queries  per step and the expected regret per step. Suppose that an increasing function CQ assigns a cost, in the same unit as the rewards and the regret, to the rate of queries . The corresponding new objective thus becomes the sum of two components:

Piecewise-stationary Bandit Problems with Side Observations

Input: interval length  > 0, threshold  > 0, and  queries per step. Initialize r := 1. At each step t: 1. (Follow A.) Follow the action of an algorithm A satisfying Assumption 4.1. 2. (Querying policy.) If t belongs to the m-th interval except its first step, i.e., if t  [(m - 1) + 2, . . . , m ], let t-1 (i) denote the number of queries arm Ai has received since the start of the m-th interval until step t - 1. Order the arms {A1 , . . . , An } according to t-1 (1), . . . , t-1 (n). Query the set St of arms that received the fewest queries. Update the following elements of the empirical mean bm : bm (i) := t-1 (i) bm (i) + bt-1 (i) , t-1 (i) + 1 for every i  St .

3. (Detect change.) At the start of the m-th interval, i.e., if t = (m - 1) + 1 for some m = 3, 4, . . .. If bm - br  > , reset (i.e., re-instantiate) algorithm A and set r := m. The index r denotes the last interval at which the algorithm A was reset.
Figure 1. Windowed mean-shift detection (WMD) algorithm

query cost and regret. This overall expected costper-step at time T is CQ () + RT /T . With the implicit assumption that the bound (2) is tight in the duration T , the number of changes k, and the query rate , this cost can be optimized with respect to . If each query is assigned a constant cost cq , i.e., CQ () = cq · , then the (non-discrete) optimal query rate is  = (7kn/CQ ) log(T )/T . This is the type of optimization problem that has to be resolved in Example 1.2. Proof of Theorem 4.1. The proof is composed of five steps. In the first step, we identify the components of the regret. In the second step, we analyze the empirical means computed by the WMD algorithm. In the successive steps, we bound the components of the regret. The components are combined in the final step. Step 1. Let L(T ) denote the expected number of intervals after a change-point j occurs before it is detected by the WMD algorithm (i.e., algorithm A is reset). Let N (T ) be the expected number of false detections up to T , i.e., instances when the algorithm A resets when no change-point has occurred since the last time A was reset. Observe that the total number of times the algorithm resets is bounded from above by k + N (T ). Hence, over a T -step horizon, there are at most k + N (T ) interval-periods during which the algorithm resets once and the source distribution does not change. By Assumption 4.1, during each such period, the expected regret is of the order of Cn log()/, where  is the length of the period. Since log is a concave function and   T , the expected regret over all such periods is at most (Cn/)(k + N (T )) log(T ).

The algorithm may also incur regret during the delay between distribution change and its detection. Since there are k distribution changes, each ocurring at most L(T ) time steps before the algorithm A resets. Hence, the total regret of this algorithm is at most k(L(T ) + 1) + Cn (k + N (T )) log(T ).  (3)

Next, we bound N (T ) and L(T ), starting with N (T ). Observe that the term k accounts for the regret within intervals during which a change occurs. Hence, for the remainder of the proof, we consider only the intervals that do not contain a distribution change. Step 2 (Empirical means). Consider the empirical mean over each interval that does not contain a change. Let  =  / n  =  log(T ) . Observe that by the con 22 struction of the WMD algorithm, after the end of the m-th interval, spanning time steps (m - 1) + 1, . . . , m , every arm is queried either  or  + 1 times. In the former case, the empirical mean for an arm Ai is the mean of  i.i.d. random variables bm (i) = 1 
m

t=(m-1) +1

bt (i) · 1[iSt ] .

The expression for the latter case (of  + 1 queries) is similar, and is omitted in this proof. Step 3 (Number of false detections). Suppose that in the opponent action sequences during the m-th and rth intervals are generated from the same distribution with expected value denoted, with an abuse of nota-

Piecewise-stationary Bandit Problems with Side Observations

tion, by m . Observe that N (T )  T / P bm - br
i=1,...,n 

>

we bound the first term of (7). Suppose, without loss of generality, that s (i) > m (i); let  denote s (i) - m (i); we obtain, for some i: P bs (i) - m (i) > 3/2 bs (i) - s (i) +  > 3/2

 nT /  min P

bm (i) - br (i) >  ,

since there are at most T /  intervals. Observe that, for every i = 1, . . . , n, P bm (i) - br (i) >  bm (i) - br (i) > , bm (i) - m (i)   bm (i) - br (i) > , bm (i) - m (i) >  bm (i) - br (i) >  | bm (i) - m (i)   bm (i) - m (i) >  br (i) - m (i) > 2 bm (i) - m (i) >  (4)

=P

= 1 - P  - 3/2  bs (i) - s (i)   + 3/2  1 - P  - 3/2  bs (i) - s (i)  1 - exp(-2( - 3/2)2 )  1 - exp(-2 /2), (8) where the last two inequalities follows from the fact that  = s (i) - m (i) > 2 for some i by Assumption 4.2. Hence, (7), (8) and (6) give L(T )  2 1 - exp -2 /2 . (9)

=P +P P +P P +P

Step 5 (Tying up). By combining (3) with (5) and (9), we find that expected regret is at most: 2k + k (1 - exp(-2 /2)) Cn k + 2n exp -22 (T / + 1) log(T ), +  from which (2) follows by substituting the values  =  n  ·  log(T )  and  =  log(T ) .  22 22

 exp -82 + exp -22 , where the last inequality follows by Step 2 and Hoeffding's Inequality (recall that br (i) and bm (i) have the same distribution). Hence, we have N (T )  2n exp -22 (T / + 1). (5)

Step 4 (Delay in change detection). Next, we bound L(T ). Suppose that there is a reset at the s-th interval. The WMD algorithm compares successively the empirical means bs+1 , bs+2 , . . . to bs . Suppose that the following change occurs during the (m - 1)-th interval. Let m and s denote the expected reward vectors during m-th and s-th intervals, respectively. By the same argument as the first occurrence of an event in an i.i.d. random sequence, we have L(T )  1 P bm - bs


5. Simulations
In this section, we present an empirical comparison of the WMD algorithm with other algorithms for multiarmed bandit problems. As reference, we consider two algorithms based on upper confidence bounds: the UCB1 algorithm of (Auer et al., 2002a), and the Discounted-UCB algorithm of (Kocsis & Szepesv´ri, a 2006; Garivier & Moulines, 2008). For comparison purpose, we employ the UCB1 algorithm as the component A of the WMD algorithm. The resulting combination is referred to as the WMD-UCB algorithm. For our setting, we take a bandit with 4 arms, whose rewards are piecewise-stationary with Bernoulli distributions. The sequence of expected rewards t (i) for each arm Ai is illustrated in Figure 2. The DiscountedUCB algorithm is provided with prior knowledge of the number k of changes to come in the reward sequence, and its parameters are accordingly set to optimal values (Garivier & Moulines, 2008). Neither the UCB1, nor the WMD-UCB algorithm require this prior information. However, the WMD-UCB algorithm has the privilege to query the previous rewards of some of the arms.

> .

(6)

Observe that, for every i = 1, . . . , n, P P P =P P bm - bs


>

bm (i) - bs (i) >  bs (i) - m (i) > 3/2, bm (i) - m (i)  /2 3 2 3 bs (i) - m (i) > 2 bs (i) - m (i) > P bm (i) - m (i) 
2

 2 (7)

1 - e-

/2

,

where the equality is due to independence, and the final inequality follows by Hoeffding's Inequality. Next,

Piecewise-stationary Bandit Problems with Side Observations

1

0.8

UCB Discounted-UCB WMD-UCB Resets of WMD-UCB Baseline

Average reward

0.6

0.4

0.2

0 0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Time

Figure 3. Average expected regret of UCB, Discounted-UCB, and WMD-UCB against the bandit of Figure 2. The WMDUCB uses the threshold  = 0.3 and makes 1 query per step. Changes in the reward sequence distribution are indicated by vertical lines, whereas instants at which the WMD-UCB algorithm resets are indicated by diamonds. The baseline of our notion of regret (1) is also plotted.
0.9
1

0.85
0.9

0.8
0.8 0.7

Expected reward

Average reward

0.6 0.5 0.4 0.3 0.2 0.1 0 0 2000 4000 6000 8000

Arm Arm Arm Arm

1 2 3 4

0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0 2000 4000 6000 Time
=1 =2 =3

10000

8000

10000

Time

Figure 2. Expected reward of the arms of a 4-armed bandit.

Figure 4. Average expected regret of the WMD-UCB algorithm with 1, 2, and 3 queries per step against a 4-armed bandit. The threshold parameter  is 0.2.

Figure 3 shows the evolution of the average reward of the three algorithms. In this experiment, the WMDUCB algorithm queries only one arm per step; its average reward is close to optimal with respect to the baseline of (1). Figure 4 illustrates the benefit of increasing the number of queries per step of the WMDUCB algorithm (with the interval length  held fixed).

6. Discussions
The WMD algorithm uses a very simple scheme to detect changes in the mean. In its place, we may employ more sophisticated change-detection schemes, e.g., CUSUM (Page, 1954) and the Shiryayev-Roberts rule (Shiryayev, 1963). Modifications are nonetheless

Piecewise-stationary Bandit Problems with Side Observations

required to make them applicable to our problem: the reward distributions must be parametrized; and the pre-change distribution is unknown and must be estimated (cf. Mei, 2006). There also exist schemes that detect changes when the reward process follows one of many Markovian processes (Fuh, 2004), as is the case for restless bandit problems. Despite the drawback of complexity, these schemes detect changes with optimal delay, and do not require prior knowledge of the parameter  of Assumption 4.2. Yet, they also incur a regret of the order of log(T ) due to an inevitable logarithmic delay to detection (Lorden, 1971; Pollak, 1987). This provides, in our model, a lower-bound on the regret of (k log(T )) for every algorithm that detect the unknown changes and react thereafter. The side information obtained through queries can be applied to two purposes: detecting changes and improving the performance of the multi-armed bandit algorithm of Assumption 4.1. In this paper, the queries serve only the purpose of change detection. Because of the aforementioned lower regret-bound intrinsic to change detection schemes, we have neglected the question of accelerating the exploration of the subalgorithm of Assumption 4.1. The action elimination method of (Even-Dar et al., 2006) presents another possible improvement to the sub-algorithm of Assumption 4.1. As a further improvement to the detection component of the WMD algorithm, it is sufficient, when the distribution changes are not adversarial, to limit detections to changes where the current best arm is no longer the best. Finally, it would be interesting to consider different models of querying for side information. For instance, the case when queries may succeed or fail according to an i.i.d. random sequence, or the case where the agent queries two arms and then picks the best of the two.

Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2002b). The nonstochastic multiarmed bandit problem. SIAM J. Computing, 32, 48­77. Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge University Press. Even-Dar, E., Mannor, S., & Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res., 7, 1079­1105. Fuh, C. D. (2004). Asymptotic operating characteristics of an optimal change point detection in hidden Markov models. Ann. Statist., 2305­2339. Garivier, A., & Moulines, E. (2008). On upperconfidence bound policies for non-stationary bandit problems. Preprint. http://arxiv.org/abs/0805. 3415. Hartland, C., Gelly, S., Baskiotis, N., Teytaud, O., & Sebag, M. (2006). Multi-armed bandit, dynamic environments and meta-bandits. Preprint. http:// hal.archives-ouvertes.fr/hal-00113668/en/. Herbster, M., & Warmuth, M. K. (1998). Tracking the best expert. Machine Learning, 32, 151­178. Kocsis, L., & Szepesv´ri, C. (2006). Discounted-UCB. a 2nd PASCAL Challenges Workshop. Lai, T. L. (2001). Sequential analysis: some classical problems and new challenges. Statistica Sinica, 11, 303­408. Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4­22. Littlestone, N., & Warmuth, M. (1994). The weighted majority algorithm. Information and Computation, 108, 212­261. Lorden, G. (1971). Procedures for reacting to a change in distribution. Ann. Math. Statist., 42, 1897­1908. Mei, Y. J. (2006). Sequential change-point detection when unknown parameters are present in the prechange distribution. Ann. Statist., 34, 92­122. Page, E. S. (1954). Continuous inspection scheme. Biometrika, 41, 100­115. Pollak, M. (1987). Average run lengths of an optimal method of detecting a change in distribution. Ann. Statist., 15, 749­779. Shiryayev, A. N. (1963). On optimum methods in quickest detection problems. Theory Probab. Appl., 8, 22­46.

Acknowledgments
This research was funded by the NSERC Postgraduate Graduate Scholarship, the FQRNT Doctoral Research Scholarship, the ISF under grant 890015, and the Horev Fellowship. We thank the anonymous reviewers for their comments.

References
Akakpo, N. (2008). Detecting change-points in a discrete distribution via model selection. Preprint. http://arxiv.org/abs/0801.0970. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002a). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47, 235­256.