High-Probability Regret Bounds for Bandit Online Linear Optimization

Peter L. Bartlett  UC Berkeley

Varsha Dani University of Chicago Alexander Rakhlin  UC Berkeley

Thomas P. Hayes TTI Chicago
hayest@tti-c.org

Sham M. Kakade TTI Chicago
sham@tti-c.org

bartlett@cs.berkeley.edu varsha@cs.uchicago.edu

Ambuj Tewari  TTI Chicago
tewari@tti-c.org

rakhlin@cs.berkeley.edu

Abstract
We present a modification of the algorithm of Dani et al. [8] for the online linear optimization problem in the bandit setting, which with high proba bility has regret at most O ( T ) against an adaptive adversary. This improves on the previous algorithm [8] whose regret is bounded in expectation against an oblivious adversary. We obtain the same dependence on the dimension (n3/2 ) as that exhibited by Dani et al. The results of this paper rest firmly on those of [8] and the remarkable technique of Auer et al. [2] for obtaining highprobability bounds via optimistic estimates. This paper answers an open question: it eliminates the gap between the high-probability bounds obtained in the full-information vs bandit settings.

1 Introduction
In the online linear optimization problem, there is a fixed decision set D  Rn and the player (or decision maker) makes a decision xt at time t  {1, . . . , T }. Simultaneously, an adversary chooses a loss vector Lt and the player suffers loss L xt . The goal is to minimize regret which measures how t much worse the player did as compared to any fixed decision, even one chosen with complete knowledge of the sequence L1 , . . . , LT , R= tT
=1

L xt - min t
xD

tT
=1

L x . t

The adversary can be oblivious to the player's moves in which case it chooses the entire sequence L1 , . . . , Lt in advance of the player's moves. An adaptive adversary can, however, choose Lt based on the player's moves x1 , . . . , xt-1 up to that point. In the full information version of the problem, the loss vector Lt is revealed to the player at the end of round t. For this case, Kalai and Vempala [12] gave an efficient algorithm assuming that the offline problem (given L minimize L x over x  D) can be solved efficiently. Note that the standard
PB, AR and AT gratefully acknowledge the support of DARPA under grant FA8750-05-2-0249.


"experts" problem is a special case of this problem because we can choose the set D to be {e1 , . . . , en }, the unit vectors forming the standard basis of Rn . Kalai and Vempala separated the issue of the number of available decisions from the dimensionality of the probl and gave an algorithm with em expected regret O(poly(n) T ). In many important cases, for example the online shortest path problem [15], the size of the decision set can be exponential in the dimensionality. So, it is important to design algorithms that have polynomial dependence on the dimension. In the partial information or "bandit" version of the problem, the only feedback that the player receives at the end of round t is its own loss L xt . The bandit version of the ext perts problem was considered by Auer et al. [2] who gave a number of algor ms for the problem. Their Exp3 algoith rithm achieves O( T ) expected regret against oblivious adversaries. However, due to the large variance of the estimates kept by Exp3 it fails to enjoy a similar regret bound with high probability. To address this issue, the authors used the idea of high confidence upper bounds to derive the Exp3.P  algorithm which achieves O( T ) regret with hig| probabilh D| depenity. The regret of these algorithms also has a dence on the number |D| of available actions. Hence, these cannot be used directly if |D| is large. Awerbuch and Kleinberg [4] were the first to consider the general online linear optimization problem in the bandit setting. For oblivious adversaries, they proved a regret bound of O (poly(n)T 2/3 ). The case of a general adaptive adversary was handled by McMahan and Blum [14] but they could only prove a regret bound of O (poly(n)T 3/4 ). Dani and Hayes [7] later showed that McMahan and Blum's algorithm actually enjoys a regret bound of O (poly(n)T 2/3 ). However, the known lower bound for the bandit problem was  the same as that in the full information case, namely ( T ). Therefore, it was an important open questi if there is an alon gorithm with a regret bound of O(poly(n) T ) for the bandit online linear optimization problem. An affirmative answer was recently given by Dani et al. [8]. Their algorithm has  expected regret at most O (poly(n) T ) against an oblivious adversary. It was still not known if the same bounds could be achieved with high probability and against adaptive adversaries as well. In this paper, we show how to do this by combining Dani et al.'s techniques with those of Auer et al. [2]. Like Exp3.P, our G E O M E T R I C H E D G E . P algorithm


keeps biased estimates of the losses of different actions such that, with high probability, the sums of these estimates are lower bounds (because we use losses not gains) on the actual unknown cumulative losses (Lemma 5). The bandit version of the online shortest path problem has recently received a lot of attention. It can be used to model, for example, routing in ad hoc wireless networks. If we want to make our routing algorithm secure against adversarial attacks, it is necessary to design algorithms that work against adaptive adversaries [3, 13]. Therefore, obtaining low regret against adaptive adversaries is not only an important theoretical problem but it also has practical implications. The algorithm with the best regret guaran¨ tee so far is by Gyorgy et al. [11]. There the authors consider a number of feedback models. Our feedback model in this paper corresponds to what they call the "path-bandit" model. For this model, they give an efficient algorithm specially designed for the bandit online shortest path problem that achieves O (poly(n)T 2/3 ) regret with high probability against an adaptive adversary where n is the number of edges in the graph. Our results imply that it is actually possible to  achieve O (n3/2 T ) regret with high probability. However, since our algorithm is not efficient, the quest for an efficient algorithm with the same regret, even for this special problem, is still on. The key tools from probability theory that we use in our proofs are Bernstein-type inequalities, such as Freedman's. These provide sharper concentration bounds for martingales in the presence of variance information. There is a simple corollary of Freedman's inequality that we think is useful not just in our setting but more generally. We state it as Lemma 2 in Section 4. The present work closes the gap between full information and bandit online optimization against the adaptive adversary in terms of the growth of regret with T . As we said above, our algorithm is not necessarily efficient, because the decision space might need to be discretized to a fine level. We mention that a parallel work by Abernethy, Hazan, and Rakhlin [1] provides an efficient algorithm for the setting; however, their result holds in expectation only (against an oblivious adversary). The present paper and [1] are addressing disparate aspects of the problem and neither result can be concluded from the other. It remains an open question whether there exists an efficient algorithm which enjoys high probability bounds on the regret.

We assume that L x  [0, 1] for all x  D. We also ast sume that the environment is adaptive, i.e., the cost vector Lt selected by the environment at time t may depend arbitrarily on the history (L1 , x1 , . . . , Lt-1 , xt-1 ) (note that without loss of generality this dependence may be assumed to be deterministic.) We show that even against such a powerful environment, it is possible to ensure that R is small with high probability. As in [8], we will require a barycentric spanner for D. Recall that a barycentric spanner for D is a set {y1 , . . . , yn }  D such that every x  D can be written as a linear combination of yi 's with coefficients in [-1, 1]. A c-barycentric spanner is defined similarly where we allow coefficients to be in [-c, c]. For c > 1, c-barycentric spanners for D may be found efficiently (see [4].) However, for ease of exposition we'll assume that we have an actual barycentric spanner. (Using a c-barycentric spanner instead will only affect the constants.) Finally, if the set D is too large (for example if it is infinite) we can replace it by a cover of size at most (4nT )n/2 , as the loss  the optimal decision in this cover is within an addiof tive nT of the optimal loss in D; see [8][Lemma 3.1] for details. Accordingly, after doing this transformation if necessary, we may assume that D is finite and ln |D| = O(n ln T ). Only the logarithm of the cardinality of the set will enter in our bounds.

3 Algorithm and Main Result
The algorithm presented below is a modification of the algorithm in [8]. Note that the difference is in the way we update weights wt , using lower confidence intervals. This idea of using confidence intervals is motivated by the Exp3.P algorithm of Auer et al. [2]. Feeding in confidence bounds, as opposed to unbiased estimates of the losses, to the exponential updates is the crucial change we make to the algorithm of Dani et al [8]. Lemma 5 below shows that, with high tL t probability, for any x  D,  Lt x t (x) lower bounds (up to an additive O( T ) term). Our algorithm reduces to Exp3.P in the special case of the n-armed bandit problem (when D = {e1 , . . . , en }. As we point out in the next section, Auer et al.'s proof can be simplified by using the simple corollary of Freedman's inequality [10] that we state as Lemma 2 below. The main result of this paper is the following guarantee on the algorithm. Theorem 1 Let T  4, n  2 and   n3/2  1  , =   |D | log T , and  =
T
2

2 Preliminaries
Let D  [-1, 1]n denote the decision space. At each t of T time steps, the environment selects a cost vector Lt , and simultaneously, the player (decision maker) selects xt  D. The loss incurred by the decision maker for this prediction is L xt . Let t tT  Lmin := min Lt x
x D =1

1 e.

If we set  =
)

nT +2

ln(1/

, then against

any adaptive adversary with probability at least 1 - 4 ,  R = O(n3/2 T ln(nT / )). The dependence on T is optimal (up to logarithmic factors). We get the same dependence on n as Dani et al. [8].  The lower bound known for this problem is (n T ) [8].  Recently, O(n T ) regret bounds have been obtained for the stochastic version of the problem [9]. This leads us to conjecture that the lower bound is tight and it remains an open

be the loss of the best single decision in hindsight. The goal of the decision maker is to minimize the regret, tT  R= Lt xt - Lmin .
=1


Algorithm 3.1: G E O M E T R I C H E D G E . P(D,  ,  ,  x  D, w1 (x) := 1 W1 := |D| for t = 1 to T x  D,  t (x pt (x) = (1 -  ) wWt ) + n I{x  spanner} Sample xt according to distribution pt Incur and observe loss t := L xt t Ct := Ept [xx ] Lt := tC-1 xt t l ) x  D, Lt (x) := L x - 2x C-1 x n(1/ t t
nT

)

 jl
=0

Prob t Prob 

l Xt > 2j n(1/ ) 2 2 & j -1 < V  j Xt > 2j l
2 n(1/ ) & V  j

t

(

jl
=0

 2 -4j ln(1/ ) 2 b  exp  l 2 2 2j + 3 j n(1/ ) =0   jl -2j ln(1/ ) l b = exp  j + 2 n(1/ ) =0 3
)

jl

x  D, wt+1 (x) := wt (x) exp{- Lt (x)} x Wt+1 = D wt+1 (x)

where the inequality ( ) follows from Freedman's inequall ity (Theorem 9). If we now choose 0 = b n(1/ ) then l j  b n(1/ ) for all j and hence every term in the above - <
2 ln(1/ ) summation is bounded by exp  . Choosing 1+   2/3 l = log2 ( T ) ensures that l  b T . Thus we have = T t l l Prob Xt > 2 max{2, b n(1/ )} n(1/ ) =1

question to close the gap (for the dependence on n) between upper and lower bounds. We also note here that although the analysis we provide is for losses, essentially the same algorithm, with a similar analysis, works for gains. We just have to make a few obvious changes to the algorithm: instead of subtracting, we add the correction term to the gain estimates and replace - with  in the exponential update.

 Prob

t Xt > 2 max{2, 0 } l n(1/ )  (l + 1) = (log2 ( T ) + 1)  log2 (T ) .

4 Concentration for Martingales
In this section we derive a concentration inequality for martingale difference sequences. It is a direct application of Freedman's inequality. Lemma 2 Suppose X1 , . . . , XT is a martingale difference sequence with |Xt |  b. Let Vart Xt = Var (Xt | X1 , . . . , Xt-1 ) . T Let V = t=1 Vart Xt be e sum of conditional variances th of Xt 's. Further, let  = V . Then we have, for any  < 1/e and T  4, T Prob t
=1

t This inequalitly says that, roughly speaking, Xt is of the order of  n(1/ ) which is a central limit theoremlike behavior except that  here is not fixed but is the actual sum of conditional variances, a random quantity. The overall constant in front of  is 4. This can be improved to 2 by a slightly more careful analysis. We already know of two instances in the literature where Lemma 2 can be used to give shorter proofs of certain probabilistic upper bounds. 1. The first is in the proof of Exp3.P's regret bound itself. To show that the estimates are upper bounds on the actual losses of an action, the authors explicitly use the exponential moment method in the proof of their Lemma 6.1. Essentially the same lemma can be proved by a direction application of the above lemma. 2. The other instance is in Cesa-Bianchi and Gentile's paper [5] on online to batch conversions. When an online algorithm is run on i.i.d. data with a non-negative and bounded loss function, the conditional variance of the loss at time t can immediately be bounded by the risk of the hypothesis at time t - 1. The authors use this fact along with an application of Freedman's inequality to prove a sharp upper bound (Proposition 2 in their paper) on the average risk of the hypotheses generated b y the online algorithm in terms of its actual cumulative loss. The same result can be quickly derived by an application of the above lemma.

2 l Xt > 2 max , b n(1/ )


l

n(1/ )

log2 (T ) . Proof: Notehat a crude upper bound on Vart Xt is b2 . t Thus,   b T . We choose a discretization 0 = -1 < 0 <  . < l such that i+1 = 2i for i  0 and .. l  b T . We will specify the choice of 0 shortly. We then have, = t l Prob Xt > 2 max{2, 0 } n(1/ ) jl
=0

t Prob

l Xt > 2 max{2, 0 } n(1/ ) & j -1 <   j


5 Analysis
The remainder of the paper is devoted to the proof of Theorem 1. We first state several results obtained in Dani et al [8] which will be important in our proofs. Lemma 3 For any x  D and t  {1, . . . , T }, it holds that 1. 2. 3. |L x|  n2 / t  -1 x Ct x  n2 / . x  -1 D pt (x)x Ct x L
 tx

by the arithmetic mean-geometric mean inequality. Substituting this into (1), we have t L x  t t L x + 2 max t t 1  + nT l n(1/ l n(1/ ),
)

x  C -1 x  t + nT nT

= n.

4. Et

2

 x C-1 x. t

with probability at least 1 -  log2 T . Finally, taking a union bound over all x  D and rearranging (using the fact that max{a + b, c}  a + max{b, c} if a  0) gives the required result.

We now prove a bound on the perturbed estimated costs, Lt , which are used to update the distribution. Lemma 4 For all x  D, |Lt (x)|   nT + 2 l n(1/ ). 5.2 Potential Function Analysis
 nT +2 1  ln(1/
)

Proof: For each x  D, 2  - Lt (x)|  |Lt · x| + x Ct 1 x | n2 n2 n(1/ ) +2   nT l   nT + 2 n(1/ ) using Lemma 3 and the choice of  = l

By Lemma 4 and our choice of  = l n(1/ nT
)

, we

have | Lt (x)|  1 . In the following computation, we will use the facts that e-a  1 - a + a2 whenever |a|  1. x Wt+1 = Wt  x wt (x) exp(- Lt (x)) Wt

n3/2 . T

D

5.1 High Confidence Bounds Let Et [·] denote E[·|x1 , . . . , xt-1 ]. Since we are considering adaptive (but deterministic) adversaries, Lt is not random given x1 , . . . , xt-1 . Observe that Et [xt x ] = Expt [xx ] t and thus, Et [Lt ] = Lt . However, the fluctuations of the random variable Lt are very large. The following lemma provides a bound on these fluctuations.
 Lemma 5 Assume T  4. Let  = |D| log T . Then with 2 probability at least 1 -  , simultaneously for all x  D, 1 l t t Lt (x)  Lt x + 2 + nT n(1/ )

wt (x) (1 -  Lt (x) +  2 (Lt (x))2 ) Wt D - x  1+ pt (x)Lt (x) 1-
D

s + x

x L t ( x) + n spanner

pt (x) (Lt (x))2

D

ince by definition of pt ,
 pt (x) - n I{x  spanner} wt (x) = . Wt 1-

Proof: Fix x  D. Let Mt = Mt (x) = L x - L x. Then t t (Mt ) is a martingale difference sequence. Using Lemma 3,  t 2 |Mt |  n + 1 = nT + 1. Let V = Vart (Mt ) and let   = V . Using Lemma 2, we have that with probability at least 1 -  log2 T , t t Lt x  Lt x + 2 max{2,  (1 + nT ) Now note that t 1 -  x C t 1 x  2 l n(1/ t x 
 )}

Note that we have, - x
D

pt (x)Lt (x) l x
D

=-

pt (x)L x + 2 t pt (x)L x t
D

x
D

pt (x)x l n(1/ nT


C-1 x t

n(1/ nT

)

l

n(1/  +

)

(1) ,

=-

x

)

+ 2n

- Ct 1 x

nT

nT

where the last step is by Lemma 3. Further, since (b + c)2  2(b2 + c2 ) for every b, c, apply-


ing the definition of Lt (x), we also have x pt (x) (Lt (x))2
D

( ln(1/ ) pt (x) L x)2 + (2x C-1 x)2 t t nT D =x ( n2 ln(1/ ) 2 pt (x) L x)2 + 4x C-1 x t t  nT D x = 4 ln(1/ ) x 2 pt (x)(L x)2 +  pt (x)x C-1 x t t nT D D b x  4 n ln(1/ )  2 pt (x)(L x)2 + t T D x  2 y successive applications of Lemma 3. Putting these together, we have -x Wt+1  pt (x)L x 1+ t Wt 1- D n ln(1/ ) +2 T x L + t (x) n spanner x + 2 pt (x)(L x)2 t T n ln(1/  + 8 T 
D )

x Proof: Let us define x := Expt x = D pt (x)x and L x. Note that Et L x = Et t and therefore Yt Yt := t - t t is a martingale difference sequence. We bound the conditional variance of Yt as follows. E V 2 art Yt = t (Yt ) L E 2  = x- t t t E
t

L L

 tx  tx

2 2

+ +1

E
t

( 2) t

by Cauchy-Schwarz since | t|  1 by Lemma 3 by Jensen's inequality by Lemma 3.

E    =  x
t 

xpt

- Ct 1 x + 1 E - x Ct 1 x + 1

n+1
2

Moreover, |Yt |  n / + 1 by Lemma 3. Applying Bernstein's inequality for martingale differences (see Appendix) to the sequence Yt , we obtain that with probability at least 1 - , n2 , tT 2  4 T ln(1/ ) + ln(1/ ) +1 Yt  ( n + 1) 3  =1 which is the desired bound. Lemma 7 With probability at least 1 -  , l 1 tT x L + nT n(1/ ) . t (x)   T + 2 n =1 spanner Proof: Using Lemma 5, with probability at least 1 -  , we have, for all x  spanner, l tL t  2 1  + nT n(1/ ) Lt x + t ( x)  n n n 1 l T 2  + + nT n(1/ ) , n n  because Lt x, being the loss of an element of the spanner, is bounded by 1. Summing over the n elements of the spanner, we get the desired bound. Lemma 8 With probability at least 1 -  , ( 2) tT
=1

aking logs, using the fact that ln(1 + x)  x, and summing over t, we have W - tT x   T +1 ln pt (x)L x t W1 1- =1 D n T ln(1/ ) +2 + tT x
=1

L t (x) n spanner pt (x)(L x)2 t  ) nT

+ 2

tT x
=1 D

+ 8 ln(1/

x

pt (x)(L x)2  nT + T t

2 n ln(1/ ).

The next three lemmas will bound the three summations that appear on the right hand side above. Lemma 6 With probability at least 1 -  , tT
=1

Proof: First we observe that for 1  t  T , x x pt (x)(L x)2 = pt (x)L xx Lt t t x L pt (x)xx
 t

L xt - t 

tT
=1

x

pt (x)L x t 2 4 T ln(1/ ) + ln(1/ ) 3 n2  +1 .

= L t

 ( n + 1)

= 2 x C-1 Ct C-1 xt t tt t  x C-1 xt tt


Summing over t, tT
=1

1 - 4 , for every x  D, tT
=1

tT x  C- 1 x t . tt
=1

x

pt (x)(L x)2  t

L xt  t

tT
=1

L x t

Lemma 3 tells us that, on the one hand, the summands x C-1 xt tt  are uniformly bounded by n2 / = nT , and on the other hand, that each one has expectation n, even conditioned on the previous ones. Applying the Hoeffding-Azuma inequality to the martingale difference sequence x C-1 xt - Expt x C-1 x tt t it follows that, with probability at least 1 -  , tT
=1

 + 2(1 + nT ) ln(1/ ) 1 + ln |D|  2  T ln(1/ ) + ( n + 1) + n2 4 + ln(1/ ) +1 3  n 2 T ln(1/ ) +  T l 1 + 2 + nT n(1/ ) 2 + 2 nT + 2 T n ln(1/ )  ) + 8 ln(1/ nT
 nT +2 1  ln(1/
)

Recall that  = x C-1 xt tt  nT + T 2 n ln(1/ ),

, =

n3/2  , = T

 /(|D| log2 T ),

and ln |D| = O(n ln T ). Plugging in these values yields tT
=1

completing the proof. Substituting the bounds of Lemmas 6, 7 and 8 into (2), we obtain that with probability at least 1 - 3 , W ln   1- - tT
=1

 L xt  Lmin + O(n3/2 T ln(nT / )), t

completing the proof of Theorem 1.

6 Conclusions and Open Problems
We presentedn algorithm that achieves the desired regret a bound of O ( T ) with high probability. However, the quest for an efficient algorithm with the same high-probability guarantee, even for the special case of bandit online shortest paths, is still open. Achieving similar results for general convex functions is also an intriguing open question.

T +1

W1

L xt t

2  + ( n + 1) T ln(1/ ) + n2 4 + ln(1/ ) +1 3  n T ln(1/ ) +  T 2 1 l + 2 + nT n(1/ ) 2 + 2 nT + 2 T n ln(1/ ) (  ) + 8 ln(1/ nT 3)

A Concentration Inequalities
The following inequalities are well known. Theorem 9 is from [10]. Lemmas 10 and 11 can be found, for instance, in [6], Appendix A. Theorem 9 (Freedman) Suppose X1 , . . . , XT is a martingale difference sequence, and b is an uniform upper bound on the steps Xi . Let V denote the sum of conditional variances, in V= Var ( Xi | X1 , . . . , Xi-1 ).
=1

On the other hand, using Lemma 5, we have with probability at least 1 -  , for all x  D, WT +1  - W1  - tT
=1

T t
=1

- Lt ( x)  ln |D| nT ) ln(1/ ) - ln |D|. (4)

ln

Then, for every a, v > 0, X  Prob exp i  a and V  v

a2 2v + 2ab/3 -

.

L x - 2 (1 + t

Lemma 10 (Bernstein's inequality for martingales) Let Y1 , . . ., YT be a martingale difference sequence. Suppose that Yt  [a, b] and E[Yt2 |Xt-1 , . . . , X1 ]  v a.s.

Combining (3) with (4), we have that with probability at least


for all t  {1, . . . , T }. Then for all  > 0, T t 2 T v ln(1/ ) + 2 ln(1/ )(b - a)/3 Pr Yt >
=1

 
Lemma 11 (Hoeffding-Azuma inequality) Let Y1 , . . . , YT be a martingale difference sequence. Suppose that |Yt |  c almost surely for all t  {1, . . . , T }. Then for all  > 0, T  t 2 Pr Yt > T c2 ln(1/ ) 
=1

[13] Robert Kleinberg. Online decision problems with large strategy sets. PhD thesis, MIT, 2005. [14] H. Brendan McMahan and Avrim Blum. Online geometric optimization in the bandit setting against an adaptive adversary. In Proceedings of the 17th Annual Conference on Learning Theory (COLT), 2004. [15] Eiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates. Journal of Machine Learning Research, 4:773­818, 2003.

References
[1] Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory (COLT), 2008. to appear. ` [2] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48­ 77, 2003. [3] Baruch Awerbuch, David Holmer, Herb Rubens, and Robert Kleinberg. Provably competitive adaptive routing. In Proceedings of the 31st IEEE INFOCOM, volume 1, pages 631­641, 2005. [4] Baruch Awerbuch and Robert Kleinberg. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the 36th ACM Symposium on Theory of Computing (STOC), 2004. ` [5] Nicolo Cesa-Bianchi and Claudio Gentile. Improved risk tail bounds for on-line algorithms. IEEE Transactions on Information Theory, 54(1):386­39, Jan 2008. ` ´ [6] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. [7] Varsha Dani and Thomas P. Hayes. Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary. In Proceedings of the 17th ACM-SIAM Symposium on Discrete Algorithms (SODA), 2006. [8] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. The price of bandit information for online optimization. In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20 (NIPS 2007). MIT Press, 2008. [9] Varsha Dani, Thomas P. Hayes, and Sham M. Kakade. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory (COLT), 2008. to appear. [10] David A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1):100­118, Feb 1975. ´ ¨ ´ ´ [11] Andras Gyorgy, Tamas Linder, Gabor Lugosi, and ¨ ´ Gyorgy Ottucsak. The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research, 8:2369­2403, 2007. [12] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291­307, 2005.