The Price of Bandit Information for Online Optimization Varsha Dani Department of Computer Science University of Chicago Chicago, IL 60637 varsha@cs.uchicago.edu Thomas P. Hayes Toyota Technological Institute Chicago, IL 60637 hayest@tti-c.org Sham M. Kakade Toyota Technological Institute Chicago, IL 60637 sham@tti-c.org Abstract In the online linear optimization problem, a learner must choose, in each round, a decision from a set D Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and in the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information -- how much worse the regret is in the bandit case as compared to the full inf mation case. For the full information case, the upper bound on the regret or is O ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm which achieves O (n3/2 T ) regret -- all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case -- in stark contrast to the K -arm bandit setting, where the gap in the dependence on K is exponential ( T K s. T log K ). We also v present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. 1 Introduction In the online linear optimization problem (as in Kalai and Vempala [2005]), at each timestep the learner chooses a decision xt from a decision space D Rn and incurs a cost Lt · xt , where the loss vector Lt is in Rn . This paper considers the case where the sequence of loss vectors L1 , . . . , LT is arbitrary -- that is, no statistical assumptions are made about the data generation process. The goal of the learner is to minimize her regret, the difference between the incurred loss on the sequence and the loss of the best single decision in hindsight. After playing xt at time t, the two most natural sources of feedback that the learner receives are either complete information of the loss vector Lt (referred to as the full information case) or only the scalar feedback of the incurred loss Lt · xt (referred to as the partial feedback or "bandit" case). The online linear optimization problem has been receiving increasing attention as a paradigm for structured decision making in dynamic environments, with potential applications to network routing, 1 K-Arm Full Partial Lower Bound Upper Bound Efficient Algo T ln K T ln K N/A T K TK N/A Full nT nT Sometimes Linear Optimization Partial I.I.D. Expectation High Probability nT n T n T nT n3/2 T n3/2 T Yes Sometimes ? Table 1: Summary of Regret Bounds: Only the leading dependency in terms of n and T are shown (so some log factors are dropped). The results in bold are provided in this paper. The results for the K -arm case are from Freund and Schapire [1997], Auer et al. [1998]. The i.i.d. column is the stochastic setting (where the loss vectors are drawn from some fixed underlying distribution) and the result are from Dani et al. [2007]. The expectation column refers to the expected regret for an arbitrary sequence of loss vectors (considered in this paper). The high probability column follows from a forthcoming paper Bartlett et al. [2007]; these results also hold in the adaptive adversary setting, where the loss vectors could change in response to the learner's previous decisions. The Efficient Algo row refers to whether or not there is an efficient implementation -- "yes" means there is a polytime algorithm (for the stated upper bound) which only uses access to a certain optimization oracle (as in Kalai and Vempala [2005]) and "sometimes" means only in special cases (such as Path Planning) can the algorithm be implemented efficiently. See text for further details. path planning, job scheduling, etc. This paper focuses on the fundamental regrets achievable for the online linear optimization problem in both the full and partial information feedback settings, as functions of both the dimensionality n and the time horizon T . In particular, this paper is concerned with what might be termed the price of bandit information -- how much worse the regret is in the partial information case as compared to the full information case. In the K -arm case (where D is the set of K choices), much work has gone into obtaining sharp regret bounds. These results are summarized in the left two columns in Table 1. For the full information case, the exponential weights algorithm, Hedge, of Freund and Schapire [1997] provides the regret listed. For the partial information case, there is a long history of sharp regret bounds in various settings (particularly in statistical settings where i.i.d assumptions are made), dating back to Robbins [1952]. In the (non-statistical) adversarial case, the algorithm of Auer et al. [1998] provides the regret listed in Table 1 for the partial information setting. This case has a convergence rate that is exponentially worse than the full information case (as a function of K ). There are a number of issues that we must address in obtaining sharp convergence for the online linear optimization problem. The first issue to address is in understanding what are the natural quantities to state upper and lower bounds in terms of. It is natural to consider the case where the loss is uniformly bounded (say in [0, 1]). Clearly, the dimensionality n and the time horizon T are fundamental quantities. For the full information case, all previous bounds (see, e.g., Kalai and Vempala [2005]) also have dependencies on the diameter of the decision and cost spaces. It turns out that these are extraneous quantities -- with the bounded loss assumption, one need not explicitly consider diameters of the decision and cost spaces. Hence, even in the full information case, to obtain a sharp upper bound we need a new argument to get an upper bound that is stated only in terms of n and T (and we do this via a relatively straightforward appeal to Hedge). The second (and more technically demanding) issue is to obtain aharp bound for the partial infors mation case. Here, for the K -arm bandit case, the regret is O ( K T ). Trivially, we can appeal | to this result in the linear optimization case to obtain a D|T regret by setting K to be the size of D. However, the regret could have a very poor n dependence, as |D| could be exponential in n (or worse). In Tontrast, note that in the full information case, we could appeal to the K -arm case c to obtain O( log |D|) regret, which in many cases is acceptable (such as when D is exponential in n). The primary motivation for different algorithms in the full information case (e.g. Kalai and Vempala [2005]) was for computational reasons. In contrast, in the partial information case, we seek a new algorithm in order to just obtain a sharper convergence rate (of course, we are stl also il interested in efficient implementations). The goal here is provide a regret that is O (poly(n) T ). In fact, the partial information case (for linear optimization) has been receiving increasing interest in the literature [Awerbuch and Kleinberg, 2004, McMahan and Blum, 2004, Dani and Hayes, 2006]. Here, all regrets provided are O(poly(n)T 2/3 ) or worse. We should note that some of the results here [Awerbuch and Kleinberg, 2004, Dani and Hayes, 2006] are stated in terms of only n 2 and T (without referring to the diameters of various spaces). Ther is only one (non-trivial) special e case [Gyorgy et al., 2007] in the literature where an O (poly(n) T ) regret has been established, and this case assumes significantly more feedback than in the partial information case -- their result is for Path Planning (where D is the set of paths on a graph and n is the number of edges) and the feedback model assumes that learner receives the weight along each edge that is traversed (significantly more information than the just the scalar loss). The current paper provides the first O (poly(n) T ) regret for the general online linear optimization roblem with scalar feedback -- p in particular, our algorithm has an expected regret that is O (n3/2 T ). The final issue to address here is lower bounds, which are not extant in the literature. This paper provides lower bounds for both the full and partial information case. We believe these lower bounds are tight, up to log factors. We have attempted to summarize the extant results in the literature (along with the results in this paper) in Table 1. We believe that we have a near complete picture of the achievable rates. One striking result is that the price of bandit information is relatively small -- the upper bound is only a factor of n worse than in the full information case. In fact, the lower bounds suggest the partial feedback case is only worse by a factor of n. Contrast this to the K arm case, where the full information case does exponentially better as a function of K . As we believe that the lower bounds are sharp, we conjecture that the price of bandit information is only n. Part of our reasoning is due to our previous result [Dani et al., 2007] in the i.i.d. case (where the linear loss functions are sampled from a xed, time invariant distribution) -- there, we fi provided an upper bound on the regret of only O (n T ). That bound was achieved by a deterministic algorithm which was a generalization of the celebrated algorithm of Lai and Robbins. [1985] for the K -arm case (in the i.i.d. setting). Finally, we should note that this paper primarily focuses on the achievable regrets, not on efficient implementations. In much of the previous work in the literature (for both the full and partial information case), the algorithms can be implemented efficiently provided access to a certain optimization oracle. We are not certain whether our algorithms can be implemented efficiently, in general, with only this oracle access. However, as our algorithms use the Hedge algorithm of Freund and Schapire [1997], for certain important applications, efficient implementations do exist, based on dynamic programming. Examples include problems such as Path Planning (for instance, in routing network traffic), and also Markov Decision Problems, one of the fundamental models for long-term planning in AI. This idea has been developed by Takimoto and Warmuth [2003] and also applied by Gyorgy et al. [2007] (mentioned earlier) for Path Planning -- the extension to Markov Decision Problems is relatively straightforward (based on dynamic programming). The paper is organized as follows. In Section 2, we give a formal description of the problem. Then in Section 3 we present upper bounds for both the full information and bandit settings. Finally, in Section 4 we present lower bounds for both settings. All results in this paper are summarized in Table 1 (along with previous work). 2 Preliminaries Let D Rn denote the decision space. The learner plays the following T -round game against an oblivious adversary. First, the adversary chooses a sequence L1 , . . . , LT of loss vectors in Rn . We assume that the loss vectors are admissible, meaning they satisfy the boundedness property that for each t and for all x D, 0 Lt · x = L x 1. On each round t, the learner must choose a decision t xt in D, which results in a loss of t = L xt . Throughout the paper we represent x D and Lt t as column vectors and use v to denote the transpose of a column vector v . In the full information case, Lt is revealed to the learner after time t. In the partial information case, only the incurred loss t (and not the vector Lt ) is revealed. T If x1 , . . . , xT are the decisions the learner makes in the game, then the total loss is t=1 L xt . The t cumulative regret is defined by 3 T tT t Lt xt - min L x R= t =1 xD =1 In other words, the learner's loss is compared to the loss of the best single decision in hindsight. The goal of the learner is to make a sequence of decisions that guarantees low regret. For the partial information case, our upper bounds on the regret are only statements that hold in expectation (with respect to the learner's randomness). The lower bounds provided hold with high probability. This paper also assumes the learner has access to a barycentric spanner (as defined by Awerbuch and Kleinberg [2004]) of the decision region -- such a spanner is useful for exploration. This is a subset of n linearly independent vectors of the decision space, such that every vector in the decision space can be expressed as a linear combination of elements of the spanner with coefficients in [-1, 1]. Awerbuch and Kleinberg [2004] showed that any full rank compact set in Rn has a barycentric spanner. Furthermore, an almost barycentric spanner (where the coefficients are in [-2, 2]) can be found efficiently (with certain oracle access). In view of these remarks, we assume without loss of generality, that D contains the standard basis vectors e1 . . . en and that D [-1, 1]n . We refer to the set {e1 . . . en } as the spanner. Note that with this assumption, x 2 n for all x D. 3 Upper Bounds The decision set D may be potentially large or even uncountably infinite. However, for the purposes of designing algorithms with sharp regret bounds, the following lemma shows that we need only concern ourselves with finite decision sets -- the lemma shows that any d ision set may be ec approximated to sufficiently high accuracy by a suitably small set (which is a 1/ T -net for D). ~ Lemma 3.1. Let D [-1, 1]n be an arbitrary decision set. Then there is a set D D of size n/ 2 at most (4nT ) such that for every sequence of admissible loss vectors, the optimal loss for D is within an additive nT of the optimal loss for D. 1 Proof sketch. For each x D suppose we truncate each coordinate of x to only the first 2 log(nT ) bits. Now from all x D which result in the same truncated representation, we select a single representative to be included in D This results in a set D of ze at most (4nT )n/2 which is a si 1/ T -net for D. That is, every x D is at distance at most 1/ T from its nearest neighbor in D. Since an admissible loss vector has norm at most , summing over the T rounds of the game, we n D is within an additive nT of the optimal loss for D. see that the optimal loss for For implementation purposes, it may be impractical to store the decision set (or the covering net of the decision set) explicitly as a list of points. However, our algorithms only require the ability to sample from a specific distribution over the decision set. Furthermore, in many cases of interest the full decision set is finite and exponential in n, so we can directly work with D (rather than a cover of D). As discussed in the Introduction, in many important cases of interest this can actually be accomplished using time and space which are only logarithmic in |D| -- this is due to that Hedge can be implemented efficiently for these special cases. 3.1 With Full Information In the full information Tetting, the algorithm Hedge of Freund and Schapire [1997] guarantees a s regret at most of O( log |D|). Since we may modify D so that log |D| is O(n log n log T ), this gives us regret O ( nT ). Note that we are only concerned with the regret here. Hedge may in general be quite inefficient to implement. However, in many special cases of interest, efficient implementations are in fact possible, as discussed in the Introduction. We also note that under the relatively minor assumption of the existence of an oracle for offline optimization, the algorithm of Kalai and Vempala [2005] is an efficient algorithm for this setting. However, it appears that that their regret is O(n T ) rather than O( nT ) -- their regret bounds are stated in terms of diamers of the decision and cost spaces, but we can bound these in terms of n, et which leads to the O(n T ) regret for their algorithm. 4 3.2 With Bandit Information We now present the Geometric Hedge algorithm (shown in Algorithm 3.1) that achieves low expected regret for the setting where only the observed loss, t = Lt · xt , is received as feedback. This algorithm is motivated by the algorithms in Auer et al. [1998] (designed for the K -arm case), which use Hedge (with estimated losses) along with a probability of exploration. Algorithm 3.1: G E O M E T R I C H E D G E(D, , ) 1 x D, p1 (x) |D| for t 1 to T x D, pt (x) = (1 - )pt (x) + n 1{x spanner} Sample xt according to distribution pt Incur and observe loss t := L xt t Ct := Ept [xx ] - Lt := tCt 1 xt L x D, pt+1 (x) pt (x)e- t x In the Geometric Hedge algorithm, there is a probability of exploring with the spanner on each round (motivated by Awerbuch and Kleinberg [2004]). The estimated losses we feed into Hedge are determined by the estimator Lt of Lt . Note that the algorithm is well defined as Ct is always non-singular. The following lemma shows why this estimator is sensible. Lemma 3.2. On each round t, Lt is an unbiased estimator for the true loss vector Lt . - - - Proof. Lt = tCt 1 xt = (Lt · xt )Ct 1 xt = Ct 1 xt (x Lt ). Therefore t - - - E [Lt ] = E [Ct 1 xt (x Lt )] = Ct 1 E [xt x ]Lt = Ct 1 Ct Lt = Lt t t where all the expectations are over the random choice of xt drawn from pt . In the K -arm case, where n = K and D = {e1 , . . . , eK }, Algorithm 3.1 specializes to the Exp3 algorithm of Auer et al. [1998]. Note that if |D| is exponential in the dimension n then in general, maintaining and sampling from the distributions pt and pt is very expensive in terms of running time. However in many special cases of interest, this can actually be implemented efficiently. We now state the main technical result of the paper. 1 Theorem 3.3. Let = n T and = nT in Algorithm 3.1. For any sequence L1 , . . . , LT of admissible loss vectors, let R denote the regret of Algorithm 3.1 on this sequence. Then ER ln |D| nT + 2n3/2 T 3/2 As before, since we may replace D with a set of size O((nT )n/2 ) for an additional regret of only nT , the regret is O (n3/2 T ). Moreover, if |D| cn for some constant c, as is the case for the online shortest path problem, then ER = O(n3/2 T ). 3.3 Analysis of Algorithm 3.1 In this section, we prove Theorem 3.3. We start by providing the following bound on the sizes of the estimated loss vectors used by Algorithm 3.1. Lemma 3.4. For each x D and 1 t T , the estimated loss vector Lt satisfies n2 |Lt · x| 5 Proof. First, let us examine Ct . Let 1 , . . . , n be the eigenvalues of Ct , and v1 , . . . , vn be the corresponding (orthonormal) eigenvectors. Since Ct := Ept [xx ] and i = vi Ct vi , we have x i = vi E [xx ]vi = pt (x)(x · vi )2 (1) pt D and so i = x D pt (x)(x · vi )2 x pt (x)(x · vi )2 It follows that the eigenvalues Hence, for each x spanner -1 1 , . . . -1 n jn (ej · vi )2 = vi 2 = n n n =1 - of Ct 1 are each at most n . n n2 - |Lt · x| = | tCt 1 xt · x| | t| xt 2 x 2 where we have used the upper bound on the eigenvalues and the upper bound of n for x D. The following proposition is Theorem 3.1 in Auer et al. [1998], restated in our notation (for losses M - instead of gains). We state it here without proof. Denote M ( ) := e -12 M . M Proposition 3.5. (from Auer et al. [1998])For every x D, the sequence of estimated loss vectors L1 , . . . , LT and the probability distributions p1 , . . . pT satisfy tT x tT tT x Lt · x + ln |D| + M ( ) pt (x)(Lt · x)2 pt (x)Lt · x =1 =1 =1 D D where M = n / is an upper bound on |L · x|. Before we are ready to complete the proof, two technical lemmas are useful. Lemma 3.6. For each x D and 1 t T , ( - Lt · x)2 E x Ct 1 x xt pt 2 ( = L x Proof. Using that E Lt · x)2 x E t L , we have t L x 2 x x C -1 -1 - - - x E t Lt = x E t Ct xt xt Ct x Ct 1 E t x t 1 x = x Ct 1 x t Lemma 3.7. For each 1 t T , x D - Proof. The singular value decomposition of Ct 1 is V B V where B is diagonal (with the inverse eigenvalues as the diagonal entries) and V is orthogonal (with the columns being the eigenvectors). i -1 - This implies that x Ct 1 x = i (x · vi )2 . Using Equation 1, it follows that x x in in x in - pt (x)x Ct 1 x = pt (x) -1 (x · vi )2 = -1 pt (x)(x · vi )2 = 1=n i i D D =1 =1 D =1 - pt (x)x Ct 1 x = n We are now ready to complete the proof of Theorem 3.3. Proof. We now have, for any x D, ( L t tT x pt (x)Lt · x = 1 - )pt (x) + 1{j : x = ej } t · x n ,x =1 D T t t Lt · x + ln |D| + M ( ) pt (x)(Lt · x)2 (1 - ) ,x =1 tT + tT j n Lt · ej n =1 =1 t tT j n Lt · x + ln |D| + M ( ) Lt · ej pt (x)(Lt · x)2 + n ,x =1 =1 =1 6 where the last step uses (1 - )pt (x) pt (x). Taking expectations and using the unbiased property, =T t t +T n t tj ln |D| M ( ) 2 pt (x)Lt · x Lt · x + E + E pt (x)(Lt · x) Lt · ej n ,x ,x =1 =1 =1 t + tT ln |D| M ( ) 2 Lt · x + + E pt (x) E (Lt · x) T xt pt ,x =1 tT =1 Lt · x + ln |D| M ( ) + nT + T where we have used Lemmas 3.6 and 3.7 in the last step. Setting = n3/2 T and = 1 nT gives M = n2 / 1 , which implies that M 2 2 eM - 1 - M = 2 M2 M2 where the inequality comes from that for 1, e 1 + + 2 . With the above, we have t T t Lt · x E pt (x) Lt · x + ln |D| nT + 2n3/2 T M ( ) = ,x =1 The proof is completed by noting that i t = t = t L · E pt (x)Lt · x E pt (x)E t | Ht x E ,x ,x ,x = pt (x)Lt · x E t Lt · xt s the expected total loss of the algorithm. 4 Lower Bounds 4.1 With Full Information We now present a family of distributions which establishes an ( nT ) lower bound for i.i.d. loss vectors in the full information setting. In the remainder of the paper, we assume for convenience that the incurred losses are in the interval [-1, 1] rather than [0, 1]. (This changes the bounds by at most a factor of 2.) Example 4.1. For a given S {1, . . . , n} and 0 < < 1, we define a random loss vector L as follows. Choose i {1, . . . , n} uniformly at random. Let ±1 be 1 with probability (1 + )/2 and -1 otherwise. Set ei if i S L= - ei if i S / Let DS, denote the distribution of L. Theorem 4.2. Suppose the decision set D is the unit hypercube {-1, 1}n . For any full-information linear optimization algorithm A, and for any positive integer T , there exists S {1, . . . , n} such that for loss vectors L1 , . . . , LT sampled i.i.d. according to DS,n/T , the expected regret is ( nT ). Proof sketch. Clearly, for each S and , the optimal decision vector for loss vectors sampled i.i.d. according to DS, is the vector (x1 , . . . , xn ) where xi = -1 if i S and 1 otherwise. Suppose S is chosen uniformly at random. In this case, it is clear that the optimal algorithm chooses decision (x1 , . . . , xn ) where for each i, the sign of xi is the same as the minority of past occurrences of loss vectors ±ei (in case of a tie, the value of xi doesn't matter). Note that at every time step when the empirical minority incorrectly predicts the bias for coordinate i, the optimal algorithm incurs expected regret (/n). By a standard application of Stirling's 7 estimates, one can show that until coordinate i has been chosen (1/2 ) times, the probability that the empirical majority disagrees with the long-run average is (1). In expectation, this requires (n/2 ) time steps. Summing over the n arms, the ovnrall expected regret is thus at least e (n(/n) min{T , n/2 } = (min{T , n/}). Setting = /T yields the desired bound. 4.2 With Bandit Information Next we prove that the same decision set {0, 1}n and family of distributions DS, can be used to establish an (n T ) lower bound in the bandit setting. Theorem 4.3. Suppose the decision set D is the unit hypercube {0, 1}n . For any bandit linear optimization algorithm A, and for any positive integer T , there exists S {1, . . . , n} such at for th loss functions L1 , . . . , LT sampled i.i.d. according to DS,n/T , the expected regret is (n T ). Proof sketch. Again, for each S and , the optimal decision vector for loss vectors sampled i.i.d. according to DS, is just the indicator vector for the set S . Suppose S is chosen uniformly at random. Unlike the proof of Theorem 4.2, we do not attempt to characterize the optimal algorithm for this setting. Note that, for every 1 i n, every time step when the algorithm incorrectly sets xi = 1{i S }, contributes (/n) to the expected regret. Let us fix i {1, . . . , n} and prove a lower bound on its expected contribution to the total regret. To simplify matters, let us consider the best algorithm conditioned on the value of S \ {i}. It is not hard to see that the problem of guessing the membership of i in S based on t past measurements can be recast as a problem of deciding between two possible means which differ by /n, given a sequence of t i.i.d. Bernoulli random variables with one of the unknown mean, where each of the means is a priori equally likely. But for this problem, the error probability is (1) unless t = ((n/)2 ). Thus we have shown that the expected contribution of coordinate i to the total regret is (min{T , (n/)2 }/). Summing over the n arms gives an overall n expected regret of (min{T , n2 /}. Setting = n/ T completes the proof. References P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: the adversarial multiarmed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science (2005), page 24pp. IEEE Computer Society Press, Los Alamitos, CA, extended version, dated June 8, 1998. Available from R. Schapire's website. B. Awerbuch and R. Kleinberg. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the 36th ACM Symposium on Theory of Computing (STOC), 2004. P. Bartlett, V. Dani, T. P. Hayes, S. M. Kakade, A. Rakhlin, and A. Tewari. High probability regret bounds for online optimization (working title). Manuscript, 2007. V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic bandit linear optimization (working title). Manuscript, 2007. Varsha Dani and Thomas P. Hayes. Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary. In Proceedings of the 17th ACM-SIAM Symposium on Discrete Algorithms (SODA), 2006. Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119­139, 1997. A. Gyorgy, T. Linder, G. Lugosi, and G. Ottucsak. The on-line shortest path problem under partial monitoring. ArXiv e-prints, 704, 2007. Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. J. Comput. Syst. Sci., 71 (3):291­307, 2005. ISSN 0022-0000. T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4, 1985. H.B. McMahan and A. Blum. Online geometric optimization in the bandit setting against an adaptive adversary. In Proceedings of the 17th Annual Conference on Learning Theory (COLT), 2004. H. Robbins. Some aspects of the sequential design of experiments. In Bulletin of the American Mathematical Society, volume 55, 1952. Eiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates. J. Mach. Learn. Res., 4: 773­818, 2003. ISSN 1533-7928. 8