Stochastic Linear Optimization under Bandit Feedback Varsha Dani and Thomas P. Hayes and Sham M. Kakade Abstract In the classical stochastic k -armed bandit problem, in each of a sequence of T rounds, a decision maker chooses one of k arms and incurs a cost chosen from an unknown distribution associated with that arm. The goal is to minimize regret, defined as the difference between the cost incurred by the algorithm and the optimal cost. In the linear optimization version of this problem (first considered by Auer [2002]), we view the arms as vectors in Rn , and require that the costs be linear functions of the chosen vector. As before, it is assumed that the cost functions are sampled independently from an unknown distribution. In this setting, the goal is to find algorithms whose running time and regret behave well as functions of the number of rounds T and the dimensionality n (rather than the number of arms, k , which may be exponential in n or even infinite). We give a nearly complete characterization of this problem in terms of both upper and lower bounds for the regret. In certain special cases (such as when the decision region is a polytope), the regret is polylog(). In general though, the optimal reT gret is ( T ) -- our lower bounds rule out the possibility of obtaining polylog(T ) rates in general. We present two variants of an algorithm based on the idea of "upper confidence bounds." The first, due to Auer [2002], but not fully analyzed, obtains regret whose dependence on n and T are both essentially optimal, but which may be computationally intractable when the decision set is a polytope. The second version can be efficiently implemented when the decision set is a polytope (given as an intersection of half-spaces), but gives up a factor of n in the regret bound. Our results also extend to the setting where the set of allowed decisions may change over time. Department of Computer Science, University of Chicago, varsha@cs.uchicago.edu Toyota Technological Institute at Chicago, {hayest,sham}@tti-c.org 1 Introduction The seminal work of Robbins [1952] introduced a formalism for studying the sequential design of experiments, which is now referred to as the multi-armed bandit problem. In this foundational paradigm, at each time step a decision maker chooses one of K decisions or "arms" (e.g. treatments, job schedules, manufacturing processes, etc) and receives some feedback loss only for the chosen decision. In the most unadorned model, it is assumed that the cost for each decision is independently sampled from some fixed underlying (and unknown) distribution (that is different for each decision). The goal of the decision maker is to minimize the average loss over some time horizon. This basic model of decision making under uncertainty already typifies the conflict between minimizing the immediate loss and gathering information that will be useful in the long-run. This sequential design problem -- often referred to as the stochastic multiarmed bandit problem -- and a long line of successor bandit problems have been extensively studied in the statistics community (see, e.g., [Berry and Fristedt, 1985]), with close attention paid to obtaining sharp convergence rates. While this paradigm offers a formalism to a host of natural decision problems (e.g. clinical treatment, manufacturing processes, job scheduling), a vital issue to address for applicability to modern problems is how to tackle a set of feasible decisions that is often large (or infinite). For example, the classical bandit problem of clinical treatments (often considered in statistics) -- where each decision is a choice of one of K treatments -- is often better modelled by choosing from some (potentially infinite) set of mixed treatments subject to some budget constraint (where there is a cost per unit amount of each of drug). In manufacturing problems, often the goal is to maximize revenue subject to choosing among some large set of decisions that satisfy certain manufacturing constraints (where the revenue from each decision may be unknown). A modern variant of this problem that is receiving increasing attention is the routing problem where the goal is to send packets from A to B and the cost of each route is unknown (see, e.g., [Awerbuch and Kleinberg, 2004]). We study a natural extension of the stochastic multi-armed bandit problem to linear optimization -- a problem first considered in Auer [2002]. Here, we assume the decision space is an arbitrary subset D Rn and that there is fixed distribution over cost functions. At each round, the learner chooses a decision x D, then a cost function f (·) : D [0, 1] is sampled from . Only the loss f (x) is revealed to the learner (and not the function f (·)). We assume that the expected loss is a fixed linear function, i.e. that E[f (x)] = µ · x, where the expectation is with respect to f sampled from (technically, we make a slightly weaker assumption, precisely stated in the next section). The goal is to minimize the total loss over T steps. As is standard, success is measured by the regret -- the difference between the performance of the learner and that of the optimal algorithm which has knowledge of . Note that the optimal algorithm here simply chooses the best decision with respect to the linear mean vector µ. Perhaps the most important and natural example in this paradigm is the (stochastic) online linear programming problem. Here, D is specified by linear inequality constraints. If the mean µ were known, then this is simply a linear programming problem. Instead, at each round, the learner only observes noisy feedback of the chosen decision, with respect to the underlying linear cost function. 1.1 Summary of Our Results and Related Work Auer [2002] provides the first analysis of this problem. This paper builds and improves upon the work of Auer [2002] in a number of ways. A related model was considered by Abe and Long [1999], where the decision sets are allowed to vary as a function of the time. Our results can be extended to this more general model, which we discuss in Section 7. While Auer [2002] provides an elegant deterministic algorithm, based on upper confidence bounds of µ, an analysis of the performance of this algorithm was not provided, due to rather subtle independence issues (though it was conjectured that this simple algorithm was sufficient). Instead, a more complicated master algorithm was analyzed -- this master algorithm called the simpler upper confidence algorithm as a subroutine. In this work, we directly analyze the simpler upper confidence algorithm. Unfortunately, implementing this algorithm in certain cases (when D is large or infinite) may be inefficient. However, we also provide a modification to this algorithm (that uses a different confidence region based on the L1-norm), which may be implemented efficiently for the case when D is an (infinite) convex set, given certain oracle optimization access to D. The analysis of Aue[2002] achieves a regret bound of r O ((log |D|)3/2 poly(n) T ) where n is dimension of the decision space, T is the time horizon, and |D| is the number of feasible decisions. For the simpler upper confidence algo rithm, we show that it enjoys a bound of O (n T ), which does not depend on the cardinality of the decision region, |D|. While this algorithm may be inefficient in some cases, we also provide an efficient algorithm (that uses a slightly different confidence region), which achieves a slightly worse bound of O (n3/2 T ). Using the result in Auer 002], one [2 can also derive a bound of the form O(poly(n) T ) for infinite decision sets by appealing to a naive (inefficient) covering argument (where the algorithm is run on an appropriately fine cover of D). However, this argument results in a less sharp bound in terms of n 1 , though a better reduction to Using Auer [2002], one can derive the less sharp bound of O (n5/2 T ) for arbitrary compact decision sets with two observations. First, through a covering argument, we need only consider 1 Auer [2002] may be possible. For the case of finite decision sets, such as the K -arm bandit case, a regret that is only logarithmic in the time horizon is achievable. In particular, in a different line of work, Auer et al. [2002] showed that the optimal regret for the K arm bandit case was characterized as K log T , where is the "gap" between the performance of the best arm and the second best arm. This result is stated in terms of the problem dependent constant , so one can view it as the asymptotic regret for a given problem. In fact, historically, there is long line of work in the K -arm bandit literature (e.g. [Lai and Robbins., 1985, Agrawal, 1995]) concerned with obtaining optimal rates for a fixed problem, which are often logarithmic in T when stated in terms of some problem dependent constant. Hence, in our setting, in the case where |D| is finite, we know that a log rate in the time is achievable by a direct reduction to the K -arm bandit case (though this naive reduction results in an exponentially worse dependence in terms 2 of |D|). Our work shows that a regret of n polylog(T ) can be achieved, where is a generalized definition of the gap that is appropriate for a potentially infinite D. Hence, a polylogarithmic rate in T is achievable with a constant that is only polynomial in n and has no dependence on the size of the (potentially infinite) decision region. Here, can be thought of as the gap between the values of the best and second best extremal points of the decision set (which we define precisely later). For example, if D is a polytope, then is the gap in value between the first and second best corner decisions. For the case where D is finite, is exactly the same as in the K arm case. However, for some natural decision regions, such as a sphere, is 0 so this (problem dependent) bound is not applicable. Note that is never 0 for the K -arm case (unless there is effectively one arm), so a logarithmic rate in T is always possible in the K -arm case. Note that this set of results still raises the question of whether there is an algorithm achieving polylogarithmic regret (as a function of T ) for the case when = 0, which could be characterized in terms of some different, more appropriate problem dependent constant. Our final contribution answers this question in the negative. We provide a lower bound showing that the regret of any algorithm on a articup lar problem (which we construct with = 0) is (n T ). In addition to showing that a polylogarithmic rate is not achievable in general, it also shows our upper bound is tight in terms of n and T . Note this result is in stark contrast to the K -arm case where the optimal asymptotic regret for any given problem is always logarithmic in T . We should also note that the lower bound in this paper is significantly stronger than e bound provided in Dani et al. th [2008], which is also (n T ). In this latter lower bound, the decision problem the algorithm faces is chosen as a function of the time T . In particular, the construction in Dani et al. [2008] used a decision region which was a hypercube D to be exponential in n. Second, Auer [2002] as mes that D is a su subset of the sphere, which leads to an additional n factor. To see this, note the comments in the beginning of Section 5 essentially show that a general decision region can be thought of as living in a hype ube (due to the barycentric spanner property), so the addirc tional n factor comes from rescaling the cube into a sphere. (so 0 as this a polytope) -- in fact, actually scaled > as 1/ T . In order to negate the possibility of a polylogarithmic rate for a particular problem, we must hold = 0 as we scale the time, which we accomplish in this paper with a more delicate construction using an n-dimensional decision space constructed out of a Cartesian product of 2dimensional spheres. 1.2 The Price of Bandit Information It is natural to ask how much worse the regret is in the bandit setting as compared to a setting where we received full information about the complete loss function f (·) at the end of each round. In other words, what is the price of bandit information? For the full information case, Dani et al. [2008] showed the regret is O ( nT ) (which is tight up to log factors). In fact, in the stochastic case considered here, it is not too difficult to show that, in the full information case, the algorithm of "do the best the past" achieves this rate. Hence, as the in regret is O (n T ) in the bandit case and O ( nT ) (both of which are tight up to log factors), we have characterized the price of bandit information as n, which is a rather mild dependence on n for having such limited feedback. We should also note that the work in Dani et al. [2008] considers the adversarial case, where the cost functions are chosen in an arbitrary manner rather than stochastically. Here, it was shown that the regret in the bandit setting is O(n3/2 T ) (ignoring polylogarithmic factors), though it was conjectured that this bound was loose and the optimal rate should be identical to rate for the stochastic case, considered here. It is striking that th convergence rate for the bandit sete ting is only a factor of n worse than in the full information case -- in stark contrast to the K -arm bandit setting, where the gap in the dependence on K is exponential ( T K vs. T log K ). See Dani et al. [2008] for further discussion. our assumptions are also met under the addition of any timedependent unbiased random noise function. In this paper we address the bandit version of the geometric optimization problem, where the decision maker's feedback on each round is only the actual cost t = ct (xt ) received on that round, not the entire cost function ct (·). If x1 , . . . , xT are the decisions made in the game, then define the cumulative regret by RT = tT =1 (µ xt - µ x ) where x D is an optimal decision for µ, i.e., x argmin µ x x D 2 Preliminaries Let D Rn be a compact (but otherwise arbitrary) set of decisions. Without loss of generality, assume this set is of full rank. On each round, we must choose a decision xt D. Each such choice results in a cost t = ct (xt ) [-1, 1]. We assume that, regardless of the history Ht , the conditional expectation of ct is a fixed linear function, i.e., for all x D, E (ct (x) | Ht ) = µ · x = µ x [-1, 1]. where x D is arbitrary, and we denote the transpose of any column vector v by v . (Naturally, the vector µ is unknown, though fixed.) Under these assumptions, the noise sequence, t = ct (xt ) - µ · xt is a martingale difference sequence. [We remark here that that our earlier assumption that D was compact was actually unnecessary, in light of our further assumptions that the cost functions are bounded and linear in expectation.] A special case of particular interest is when the cost functions ct are themselves linear functions sampled independently from some fixed distribution. Note, however, that which exists since D is compact. Observe that if the mean µ were known, then the optimal strategy would be to play x every round. Since the expected loss for each decision x equals µ x, the cumulative regret is just the difference between the expected loss of the optimal algorithm and the expected loss for the actual decisions xt . Since the sequence of decisions x1 , . . . , xT may depend on the particular sequence of random noise encountered, RT is a random variable. Our goal in designing an algorithm is to keep RT as small as possible. It is also important for us to make use of a barycentric spanner for D as defined in Awerbuch and Kleinberg [2004]. A barycentric spanner for D is a set of vectors b1 , . . . , bn , all contained in D, such that every vector in D can be expressed as a linear combination of the spanner with coefficients in [-1, 1]. Awerbuch and Kleinberg [2004] showed that such a set exists for compact sets D. We assume we have access to such a spanner of the decision region, though an approximate spanner would suffice for our purposes (Awerbuch and Kleinberg [2004] provide an efficient algorithm for computing an approximate spanner). Let A be a positive definite n × n matrix, and let Rn . We will use the following notation for the 1- and 2-norms based on A. 2,A := A1/2 2 = A . 1,A := A1/2 1 = in =1 |A1/2 |i . Here A1/2 is the unique positive definite n × n matrix whose square is A. 3 Main Results 3.1 Algorithms We now present our main algorithms, ConfidenceBall2 and ConfidenceBall1 . The subscripts on the names refer to the type of norm used in the algorithm; apart from scaling the radius differently, which we do only for convenience, this is the sole difference between the algorithm statements. As we shall discuss later, we are able to prove better regret guarantees for ConfidenceBall2 , matching the lower bound, up to log factors. Both algorithms can be efficiently implemented in the simplistic case when the decision set is a small finite set. Algorithm 3.1: C O N FI D E N C E BA L L2 (D, ) Initialization: Find a barycentric spanner b1 , . . . , bn for D n A1 = i=1 bi b i µ1 = 0 for t 1 to B 1 8 t2 2 t = max 28n ln t ln(t2 / ), 3 ln x2 : - µt 2,At t t= t = argmin min ( x) 2 x D Bt Algorithm 3.2: C O N FI D E N C E BA L L1 (D, ) Initialization: Find a barycentric spanner b1 , . . . , bn for D n A1 = i=1 bi b i µ1 = 0 for t 1 to B 1 8 t2 2 t = max 28n ln t ln(t2 / ), 3 ln x1 : - µt 1,At nt t= t = argmin min ( x) 1 x D Bt Incur and observe loss t := ct (xt ) At+1 = At + xt x tt µt+1 = A-11 =1 x t+ Incur and observe loss t := ct (xt ) At+1 = At + xt x tt µt+1 = A-11 =1 x t+ However, in the important special case when the decision set is a polytope presented as the intersection as halfspaces,2 ConfidenceBall1 can be implemented in polynomial time, while ConfidenceBall2 is NP-hard to implement, at least for some decision sets. More generally, ConfidenceBall1 can be implemented efficiently given oracle access to an algorithm which can find a decision in argminxD · x (where is the input). We discuss these issues further in Subsection 3.4. ConfidenceBall2 Algorithm 3.1 is due to Auer [2002], who called it the LinRel algorithm. We have generalized the statement slightly so that it can be applied in settings where D is not necessarily stored in enumerated form, and indeed, may not even be finite. We have renamed the algorithm ConfidenceBall2 to emphasize 2 its key feature of maintaining an 2 ball, Bt , which contains µ with high probability. The algorithm is motivated as follows. Suppose decisions x1 , . . . xt-1 have been made, incurring corresponding losses 1, . . . , t-1 . Then a reasonable estimate µ to the true mean cost vector µ can be constructed by minimizing the square loss: 2 µ := argmin L( ), where L( ) := x - . µ · x }, and note that E- is non-empty (unless µ = 0, in which case there is nothing to prove). Define the gap, , as = inf µ · x - µ · x xE- probability at most O( n log3 T ). More precisely, 8nT ln(T ) Prob T , RT 1 - , · ConfidenceBall1 : If ConfidenceBall1 is implemented so that it only chooses extremal points xt D (which is always possible) then, for all sufficiently large T , the cumulative regret RT of ConfidenceBall1 (D, ) is with 3 high probability at most O( n log3 T ). More precisely, 8n2 T ln(T ) Prob T , RT 1 - , Analogous to the K -arm case, when > 0, a polylogarithmic rate in T is achievable with a constant that is only polynomial in n and has no dependence on the size of the decision region. The following upper bound is stated without regard to the specific parameter for a given problem. Furthermore, it also holds for the case when = 0. Theorem 2 (Problem Independent Upper Bound) Recall that 1 8 T2 2 . 2 T = max 28n ln T ln(T / ), 3 ln Let 0 < < 1. We have: · ConfidenceBall2 : For all sufficiently large T , the cumulative regret RT of Con enceBall2 (D, ) is with high fid probability at most O (n T ), where the O notation hides a polylogarithmic dependence on T . More precisely, 8 nT T ln T 1- . Prob T , RT · ConfidenceBall1 : For all sufficiently large T , the cumulative regret RT of ConfidenceBall1 (D, ) is with high probability at most O (n3/2 T ), where the O notation hides a polylogarithmic dependence on T . More precisely, 8 Prob T , RT n2 T T ln T 1- . The following subsection shows our bound of O (n T ) is tight, in terms of both n and T . Also, as mentioned in the Introduction, tightly characterizing the dimensionality dependence allows us to show that the price of bandit in formation is ( n). 3.3 Lower Bounds Note that our upper bounds still leave open the possibility that there is a polylogarithmic regret (as a function of T ) for the case when = 0, which could be characterized in terms of some different, more appropriate problem dep dent conen stant. Our next result is a lower bound of (n T ) on the expected regret, showing that no such improvement is possible. For the lower bound, we must consider a decision region with = 0, which rules out polytopes and finite sets (so the decision region of a hypercube, used by Dani et al. [2008], 2 so the is just the difference in costs between the optimal and next to optimal decision among the extremal points. Note that if D is a fixed polytope then > 0. However, if D is a ball then = 0, as all points on the surface (a sphere) are extremal -- so inf xE- µ · x = µ · x (and no point in E- achieves this value). We now state the first upper bound, which is a problem dependent bound stated in terms of . Theorem 1 (1 roblem Dependent Upper Bound) R. call that P e 8 T2 2 2 T = max 28n ln T ln(T / ), 3 ln Let 0 < < 1. Suppose the decision set D and the true mean µ have a gap > 0. We have: · ConfidenceBall2 : For all sufficiently large T , the cumulative regret RT of ConfidenceBall2 (D, ) is with high is not appropriate here. See Introduction for further discussion). The decision region is constructed as follows. Assume n is even. Let Dn = (S 1 )n/2 be the Cartesian product of n/2 circles. That is, Dn = {(x1 , . . . , xn ) : x2 + x2 = x2 + x2 = 1 2 3 4 · · · = x2 -1 + x2 = 1}. Observe that Dn is a subset of the n n intnrsection of the cube [-1, 1]n with the sphere of radius e /2 centered at the origin. Our cost functions take values in {-1, +1}, and for every x Dn , the expected cost is µ · x, where nµ Dn . Since each cost function is only evaluated at one point, any two distributions over {-1, +1}-valued cost functions with the same value of µ are equivalent for the purposes of our model. Theorem 3 (Lower Bound) If µ is chosen uniformly at random from the set Dn /n, and the cost for each x Dn is in {-1, +1} with mean µ · x, then, for every algorithm, for every T 1, 1 E R = E E (R | µ) n T. µ 10 where the inner expectation is with respect to observed costs. In addition to showing that a polylogarithmic rate is not achievable in general, this bound shows our upper bound is tight in terms of n and T . Again, contrast this with the K arm case where the optimal asymptotic regret for any given problem is always logarithmic in T . 3.4 Computational Efficiency We now turn our attention to the computational complexity of implementing the ConfidenceBall algorithms. As discussed in Section 2, it is easy to find an approximate barycentric spanner in O(n2 ) time. Of all the other steps in the algorithm, the only one which poses serious difficulties is the selection of the decision xt : xt := argmin min ( x) x D Bt On the other hand, for ConfidenceBall2 , the minimization problem can easily be seen as polynomial-time equivalent to the negative definite linearly constrained quadratic programming problem minimize - - µt 2,At 2 subject to M x b and x C, where M x b is the system defining the decision set D, and C is a real parameter. Since Sahni [1974] proved that solving such programs is NP-hard, ConfidenceBall2 may not be computationally practical for large n. 4 Concentration of Martingales In our analysis, we use the following Bernstein-type concentration inequality for martingale differences, due to Freedman [1975] (see also [McDiarmid, 1998, Theorem 3.15]). Theorem 4 (Freedman) Suppose X1 , . . . , XT is a martingale difference sequence, and b is an uniform upper bound on the steps Xi . Let V denote the sum of conditional variances, in Var ( Xi | X1 , . . . , Xi-1 ). V= =1 Then, for every a, v > 0, X exp Prob i a and V v a2 2v + 2ab/3 - . 5 Upper Bound Analysis Throughout the proof, without loss of generality, assume that the barycentric spanner is the standard basis e1 . . . en (this just amounts to a choice of a coordinate system, where we identify the spanner with the standard basis). Hence, the decision set D is a subset of the cube [-1, 1]n . In particular, this implies x n for all x D. This is really only a notational convenience; the problem is stated in terms of decisions in an abstract vector space, and expected costs in its dual, with no implicit standard basis. In establishing the upper bounds there are two main theorems from which the upper bounds follow. The first is in showing that the confidence region is appropriate. Let E be the event that for every time t T , the true mean µ lies in 2 1 the confidence region, Bt or Bt . The following shows that event E occurs with high probability. More precisely, Theorem 5 (Confidence) Let > 0. · For ConfidenceBall2 , 2 Prob t, µ Bt 1 - . · For ConfidenceBall1 , 1 Prob t, µ Bt 1 - . Section 5.2 is devoted to establishing this confidence bound. In essence, the proof seeks to understand the growth of the quantity (µt - µ) At (µt - µ), which involves a rather technical construction of a martingale (using the matrix inversion where Bt is the confidence ball. Now, if |D| is small, we can enumerate all choices for x, and the inner minimization is easy for both norms. This shows that an implementation in time poly(n)|D| is possible. There are also some special cases, such as when D is the unit ball, when the algorithm can be implemented in time poly(n) using a little calculus, despite |D| being infinite. We leave the details as an exercise to the interested reader. The most practically relevant setting is when D is (the vertex set of) a polytope defined by a system of linear inequalities (or equivalently, the intersection of a given set of halfspaces). In this case, the number of vertices of D may be exponential in the number of inequalities. In this setting (and others), we can assume oracle access to an algorithm which can efficiently find a decision in argminxD · x (where is the input). Here, in the case of ConfidenceBall1 , we can enumerate over the 2n vertices of Bt to find the optimum. For each such Bt , we can call this oracle to find the optimal x D, and then we can choose the appropriate decision out of these 2n decisions. Thus, the decision can be found in O(n) calls to this oracle. lemma) along with a careful application of Freedman's inequality (Theorem 4). The second main step in analyzing ConfidenceBall2 is to show that as long as the aforementioned high-probability event holds, we have some control on the growth of the regret. The following bounds the sum of the squares of instantaneous regret. Theorem 6 (Sum of Squares Regret Bound) Let rt = µ · xt - µ · x T 2 at least 1 - , t=1 rt 8nT ln T . Applying the CauchySchwarz inequality, we have, with probability at least 1 - tT RT = rt =1 T tT =1 1/2 2 rt denote the instantaneous regret acquired by the algorithm on round t. · For ConfidenceBall2 , if µ tT =1 1 · For ConfidenceBall1 , if µ Bt for all t T , then 2 Bt 8 nT T ln T which completes the proof. We now provide the proofs of these two theorems. 5.1 Proof of Theorem 6 In this section, we prove Theorem 6, which says that the sum of the squares of the instantaneous regrets of the algorithm is small, assuming the evolving confidence balls always contain the true mean µ. A key insight is that on any round t 2 in which µ Bt , the instantaneous regret is at most the "width" of the ellipsoid in the direction of the chosen decision. Moreover, the algorithm's choice of decisions forces the ellipsoids to shrink at a rate that ensures that the sum of the squares of the widths is small. We now formalize this. Lemma 7 Let x D. Then 2 · For ConfidenceBall2 , if Bt and x D. Then -1 |( - µt ) x| t x At x 1 · For ConfidenceBall1 , if Bt and x D. Then n |( - µt ) x| t x A- 1 x t for all t T , then 2 rt 8nT ln T tT =1 2 rt 8n2 T ln T This is proven in Section 5.1. The idea of the proof involves a potential function argument on the log volume (i.e. the log determinant) of the "precision matrix" At (which tracks how accurate our estimates of µ are in each direction). The proof involves relating the growth of this volume to the regret. At this point the proofs of Theorems 1 and 2 diverge. To show the former, we use the gap to bound the regret in terms T 2 of t=1 rt . For the latter, we simply appeal to the CauchySchwarz inequality. Using these two results we are able to prove our upper bounds as follows. Proof:[Proof of Theorem 1] We only prove the result for ConfidenceBall2 , as the proof for ConfidenceBall1 is analogous. Let us analyze rt = µ · xt - µ · x , the regret on round t. Since ConfidenceBall2 always chooses a decision from E , either µ · xt = µ · x or xt E- , so that µ · xt - µ · x . Since > 0 it follows that either rt = 0 or rt / 1 and in either case, r2 rt t 2 By Theorem 6, we see that if µ Bt , then RT = tT =1 Proof: Unless explicitly stated, all norms refer to the 2 norm. For ConfidenceBall2 , |( - µt ) x| = |( - µt ) At At = 1 /2 -1/2 x| -1/2 1/2 x| |(At ( - µt )) At 1/2 -1/2 At ( - . t ) At µ x ... by Cauchy-Schwarz = At ( - µt ) -1 t x At x 1/2 x A- 1 x t 2 where the last inequality holds since Bt . For ConfidenceBall1 , rt |( - µt ) x| At ( - µt ) 1 At x .... by Holder's Inequality At ( - µt ) 1 At n t x A-1 x t Define 1/2 -1/2 1 /2 -1/2 t T r2 t =1 8nT ln T Applying Theorem 5, we see that this occurs with probability at least 1 - , which completes the proof. Proof:[Proof of Theorem 2] We only prove the result for ConfidenceBall2 , as the proof for ConfidenceBall1 is analogous. By Theorems 5 and 6, we know that with probability x2 1 where the last inequality holds since Bt . -1 wt := t At xt which we interpret as the "normalized width" at timein the t direction of the chosen decision. The true width, 2 t wt , turns out to be an upper bound for the instantaneous regret. x Lemma 8 Fix t. 2 · For ConfidenceBall2 , if µ Bt , then rt 2 min ( t wt , 1) 1 · For ConfidenceBall1 , if µ Bt , then n rt 2 min ( t wt , 1) 2 Proof: Let µ Bt denote the vector which minimizes the ~ dot product µ xt . By choice of xt , we have ~ Lemma 11 For all t, det At tn . Proof: The rank one matrix xt x has x xt = xt 2 as its t t unique non-zero eigenvalue. Also, since we have identified n the spanner with the standard basis, we have i=1 bi b = I . i Since the trace is a linear operator, it follows that I = trace At = trace + xt x t 0 and t 1, for any sequence of decisions x1 , . . . , xt and outcomes 1, . . . , t-1 , the regret from round t satisfies 1 1 |bt+1 - bt |2 2 E(rt | Ht ) + {|bt | 1/2} µ 16 2 Proof: Let v1 be the unit vector in the direction of µ1 - µ2 , and let v2 be the unit vector in the direction of µ1 + µ2 . Note that v1 , v2 is an orthonormal basis for the plane. Decompose xt = v1 + v2 , and E(µ | Ht ) = v1 + v2 . Since E(µ | Ht ) = µ1 +µ2 + bt µ1 -µ2 , we have = bt /2 and = 2 2 1-2 . 2 Let p = Pr (µ = µ1 | Ht ). Then bt = 2p - 1. Observe that bt+1 - bt = = = = p(1 + tµ xt ) - (1 - p)(1 + tµ xt ) 1 2 p(1 + tµ xt ) + (1 - p)(1 + tµ xt ) 1 2 (2p - 1) + p tµ xt - (1 - p) tµ xt 1 2 1 + p tµ xt + (1 - p) tµ xt 1 2 2p(1 - p) tµ xt - 2p(1 - p) tµ xt 1 2 1 + p tµ xt + (1 - p) tµ xt 1 2 2p(1 - p) t(µ1 - µ2 ) xt 1 + p tµ xt + (1 - p) tµ xt 1 2 T - 2p + 1 - 2p + 1 1 + xt · E(µ | Ht ) 2 1 = + + 2 1 1 - 2 ) = (1 + bt + 21 2 1 1 + bt + -1 - 2 2 2 = 1 2 1 1 2 + bt + -1 - 2 2 2 1 12 ( + 2 ) + (2 + 4bt + 2 ) 16 8 1 + (2 + 2 - 22 2 ) 16 12 ( + 2 ) 16 |bt+1 - bt |2 1 2 + 16 2 ( 4) (5) ( 6) Here (4) follows because 2 + 2 = 1 implies that 1 + 2 /2, with equality iff = -1. Inequality (5) follows because |bt | 1/2 and ||, || 1. Inequality (6) follows from (3), which completes the proof. We are now ready to prove Theorem 3 in the n = 2 case. We generalize the argument to n-dimensions in Section 6.2. Proof:[Proof of Theorem 3 for n = 2] Let = T -1/4 . First, observe that, by Fubini's theorem and linearity of expectation, ER = tT =1 Ht µ E E(rt | Ht ) 2 T 1t E 16 =1 Ht + |bt+1 - bt |2 2 1. {|bt | 1/2} ... by Lemma 15 12 T Prob (for all t, |bt | 1/2) 16 = | T 1t bt+1 - bt |2 + E 1{|bt | 1/2} 16 =1 Ht 2 P T rob (for all t, |bt | 1/2) 16 + tT =1 Ht E | bt+1 - bt |2 1{|bt | 1/2} Since |µ xt | 1/2, the denominator of the above expression i is at least 1/2. Since p(1 - p) 1/4, it follows that |bt+1 - bt | |(µ1 - µ2 ) xt | = ||. (3) Assume the game history is such that |bt | 1/2. Otherwise, since the regret is non-negative, there is nothing to hus, if Prob (for all t T |bt | 1/2) 1/2 - 1/e, then we are done by the first term on the right-hand side. Otherwise, with probability at least 1/2 + 1/e, there exists t T such that |bt | 1/2. By Freedman's Bernstein-type inequality for martingales (Theorem 4) applied to the martingale bt , where = min{ : |b | 1/2}, we have ( 1 1 Prob t T ) |bt | and V 2 32 - 1/4 1 2 2 exp < 1/8 + /3 e2 e where V= tT =1 where x Dt is an optimal decision for µ, i.e., t x argmin µ x t xDt | . 1{ t, |b | 1/2}E bt+1 - bt |2 | Ht It follows that V Prob In particular, tT =1 1 > 32 1/2. The only change that needs to be made to our algorithm is that now xt is chosen from Dt instead of D. With these changes in definitions, all of our numbered Theorems and Lemmas still hold, with D replaced by Dt and x replaced by x where they appear. (This is trivial in t the case of the lower bounds.) The changes to the proofs are minimal. We note that a very similar model was considered by Abe and Long [1999], who proved a lower bound of (T 3/4 ) in their setting. However, this does not contradict our results, because their lower bound requires the dimension n to be a function of T . References | 1 . bt+1 - bt |2 1{|bt | 1/2} EV 64 Naoki Abe and Philip M. Long. Associative reinforcement learning using linear probabilistic concepts. In Proc. 16th International Conf. on Machine Learning, pages 3­11. Morgan Kaufmann, San Francisco, CA, 1999. R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27:1054­1078, 1995. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3): 235­256, 2002. ISSN 0885-6125. Peter Auer. Using confidence bounds for exploitationexploration trade-offs. J. Mach. Learn. Res., 3:397­422, 2002. ISSN 1533-7928. B. Awerbuch and R. Kleinberg. Adaptive routing with endto-end feedback: Distributed learning and geometric approaches. In Proceedings of the 36th ACM Symposium on Theory of Computing (STOC), 2004. Donald A. Berry and Bert Fristedt. Bandit Problems: Sequential Allocation of Experiments. Springer, October 1985. V. Dani, T. P. Hayes, and S. M. Kakade. The price of bandit information for online optimization. In Advances in Neural Information Processing Systems 20 (NIPS 2007). 2008. To appear. Available online at http://books.nips.cc/. David A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1):100­118, Feb. 1975. T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4, 1985. Colin McDiarmid. Concentration. In Probabilistiic Methods for Algorithmic Discrete Mathematics. Springer, 1998. H. Robbins. Some aspects of the sequential design of experiments. In Bulletin of the American Mathematical Society, volume 55, 1952. Sartaj Sahni. Computationally related problems. SIAM J. Comput., 3(4):262­279, 1974. Ht E completing the proof. 6.2 General Case Now suppose n > 2 is even. Fix an index 1 i n/2, and consider the contribution to the total expected regret from the choice of (x2i-1 , x2i ), i.e., the component from the i'th circle. Analogously to the 2-dimensional case, we condition on the i'th component of µ being one of two vectors, 1 , 2 S 1 /n. We further condition on the exact values of the other n/2L 1 components of µ. We denote = 1 - 2 - et bt denote the bias toward 1 , given the history Ht of the game on rounds 1, . . . , t - 1. That is, bt = Pr (µi = 1 | Ht ) - Pr (µi = 2 | Ht ) Then we have the following analog of Lemma 15. Lemma 16 For all t, for any sequence of decisions x1 , . . . , xt and outcomes 1, . . . , t-1 , the regret from round t due to the ith component of xt satisfies 1 1 |bt+1 - bt |2 (i) 2 E(rt | Ht ) + {|bt | 1/2} µ 64 2 It follows along the same lines as before at the expected th total regret from the ith component is ( T ). Summing over the n/2 possible values of i completes the proof. 7 Extension: time-varying decision sets Our techniques also apply to the setting when only a subset of the full decision set D is available in each round. Suppose, at time t, only a subset of decisions Dt D are available. In this case, the correct notion of regret is to compare each chosen decision xt , not with the global optimum x , but with the best choice from the available subset Dt . Thus RT = tT =1 (µ xt - µ x ) t