Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Elad Hazan IBM Almaden 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com

Satyen Kale Microsoft Research 1 Microsoft Way, Redmond, WA 98052 sakale@microsoft.com the second expert always incurs cost 1 . One would expect to 2 "figure out" this pattern quickly, and focus on the second expert, thus incurring a total cost that is at most T plus at most 2 a constant extra cost (irrespective of the number of rounds T ), thus having only constant regret. However, any straightforward application of previously known analyses of expert  learning algorithms only gives a regret bound of ( T ) in this simple case (or very simple variations of it). More generally, the natural bound on the regret of a "good" learning algorithm should depend on variation in the sequence of costs, rather than purely on the number of iterations. If the cost sequence has low variation, we expect our algorithm to be able to perform better. This intuition has a direct analog in the stochastic setting: here, the sequence of experts' costs are independently sampled from a distribution. In this situation, a natural bound on the rate of convergence to the optimal expert is controlled by the variance of the distribution (low variance should imply faster convergence). This was formalized by Cesa-Bianchi, Mansour and Stoltz [4], who assert that "proving such a rate in the fully adversarial setting would be a fundamental result". In this paper we prove the first such regret bounds on online learning algorithms in two important scenarios: prediction from expert advice, and the more general framework of online linear optimization. Our algorithms have regret bounded by the variation of the cost sequence, in a manner that is made precise in the following sections. Thus, our bounds are tighter than all previous bounds, and hence yield better bounds on the applications of previous bounds (see, for example, the applications in [4]). 1.1 Online linear optimization

Abstract
Prediction from expert advice is a fundamental problem in machine learning. A major pillar of the field is the existence of learning algorithms whose average loss approaches that of the best expert in hindsight (in other words, whose average regret approaches zero). Traditionally the regret of online algorithms was bounded in terms of the number of prediction rounds. Cesa-Bianchi, Mansour and Stoltz [4] posed the question whether it is be possible to bound the regret of an online algorithm by the variation of the observed costs. In this paper we resolve this question, and prove such bounds in the fully adversarial setting, in two important online learning scenarios: prediction from expert advice, and online linear optimization.

1

Introduction

A cornerstone of modern machine learning are algorithms for prediction from expert advice. The seminal work of Littlestone and Warmuth [12], Vovk [13] and Freund and Schapire [6] gave algorithms which, under fully adversarial cost sequences, attain average cost approaching that of the best expert in hindsight. To be more precise, consider a prediction setting in which an online learner has access to n experts. Iteratively, the learner may chose the advice of any expert deterministically or randomly. After choosing a course of action, an adversary reveals the cost of following the advice of the different experts, from which the expected cost of the online learner is derived. The classic results mentioned above give algorithms which sequentially produce randomized decisions, such that the difference between the (expected) cost of the algorithm  and the best expert in hindsight grows like O( T log n), where T is the number of prediction iterations. This extra additive cost is known as the regret of the online learning algorithm. However, a priori it is not clear why online learning algorithms should have high regret (growing with the number of iterations) in an unchanging environment. As an extreme example, consider a setting in which there are only two experts. Suppose that the first expert always incurs cost 1, whereas

Online linear optimization [10] is a general framework for online learning which has received much attention recently. In this framework the decision set is an arbitrary bounded, closed, convex set in Euclidean space K  Rn rather than a fixed set of experts, and the costs are determined by adversarially constructed vectors, f1 , f2 , . . .  Rn , such that the cost of point x  K is given by ft · x. The online learner iteratively chooses a point in the convex set xt  K , and then the cost vector ft is revealed and the cost ft · xt is occurred. The performance of online learning algorithms is measured by the regret, which is defined as the difference in the total cost of the sequence of points chosen by the algorithm, viz.


ft · xt , and the total cost of the least cost fixed point in T hindsight, viz. minxK t=1 ft · x. Several decision problems fit very naturally in this framework. For example, in the online shortest path problem the online learner has to repeatedly choose a path in a given graph from a source node to a destination node. Her cost is the total length of the path according to weights which are chosen by an adversary. This problem can be cast as an online linear optimization problem, where the decision space is the set of all distributions over paths in the graph connecting the source to the destination. Even though this set sits in exponential dimensional Euclidean space, by thinking of a distribution over paths as a flow in the graph, it is possible to efficiently represent the decision space as a polytope in R|E | (E denotes the set of edges in the given graph), described by O(|E |) constraints, and translate the cost functions to this new domain as well. The general online linear optimization framework allows for efficient and natural algorithms based on the gradient descent update rule coupled with Euclidean projections [8, 14]. Specifically, we consider Zinkevich's Lazy Projection algorithm [14]. This algorithm runs online gradient descent on an auxiliary sequence of points and chooses the projections of these auxiliary points on the convex set invery iteration. e This algorithm was shown to have regret O( T ). The crucial geometric intuition which allows us to prove regret bounds based on the variation of the cost sequence can be summarized by the following intuitive fact: the distance between successive projections for the Lazy Projection algorithm is directly related to the variation of the cost sequence. We now describe our bounds. Define the T ariation of the v sequence of cost functions to be VART = t=1 ft - µT 2, T 1 where µT = T t=1 ft is the mean of the sequence. Our analysis of the Lazy Projection algorithm yields the following regret bound: V ART ). Regret  O(
t=1

T

is the cost of expert i in the tth round, we would expect to be able to bound the regret of online Vnear optimization over li the simplex by something like O( AR log n), where T i T t VAR = max |ft (i) - µT (i)|2 T
i n =1

s the maximum variation in costs amongst the different exT 1 perts (as before, µT (i) = T t=1 ft (i) is the mean cost of the ith expert). In fact, our bound is even a bit stronger, V . Regret(T ) = O ARmax log n T Here VARmax  VAR , and is defined to be T T VARmax = max {VARt ( t)} , T
t T

where VARt (i) is the variation in costs of expert i up to the tth round, and t is the best expert till the tth round. Whereas for the general online linear optimization we consider the well-known Lazy Projection algorithms and our results are novel by tighter analysis, for the case of prediction from expert advice we need to consider a new algorithm. We can prove that existing variants of the multiplicative weights algorithms do not attain the performance above, and instead consider a different variant of update rule, in which the distribution at time t, denoted xt is defined to be , - t-1 t  -1 2 2 xt (i)  exp  f (i) - 4 (f (i) - µ (i)) t-1 where  is a learning rate parameter and µt = 1  =1 f t is the (approximate) mean cost vector at iteration t. That is, the distribution over experts explicitly takes into account the variation in their costs. As far as we know this is a new variant of the multiplicative update algorithms family, and it is necessary to include this feature to prove variation bounds on the regret. 1.3 Discussion of the results Cesa-Bianchi, Mansour and Stoltz [4] discussed desiderata for fundamental regret bounds for the expert prediction problem: invariance under translation and rescaling of costs vectors. Invariance under translation implies that the bounds depend only on the effective ranges of the cost vectors in each round, rather than the absolute ranges (by effective range, we mean the maximum difference between the costs in any given round). This is because if any round, the cost vectors are all changed by the same amount, the difference between the expected cost of the algorithm in that round and the cost of any given expert remains the same as without the translation. Our regret bounds enjoy this translation invariance property: this is a direct consequence of the variation bound. This implies, for instance, that it doesn't matter what sign the costs are, and in fact our bounds are robust enough to handle mixed signs in costs. Rescaling invariance implies that the bound continues to hold even if all the cost vectors are scaled by the same factor. Again, our regret bounds enjoy rescaling invariance since the regret and the square-root variation scale by the same factors. We make crucial use of these invariance properties in our analysis; the invariance allows us to normalize the cost vectors in ways that make them easier to reason about.
=1 =1

1.2 Prediction from expert advice Prediction from expert advice can be cast as a special case of online linear optimization: the decision space is the simplex of all distributions on n experts. The expectation operator provides a linear cost function on the simplex via the costs of the experts. Hence, our result for online linear optimization already implies variation bounds for regret in the case of prediction from expert advice. However, this bound is suboptimal, as it depends on the variation of all experts rather than, say, the maximum variation of a single expert. This issue is familiar to learning theorists: "Euclidean algorithms" such as gradient descent attain performance which relates to the Euclidean norm of the cost functions (or variations in our case). While this Euclidean flavor is optimal in certain cases (i.e. when the underlying convex set is the hyper-cube), for certain convex bodies such as the simplex, better performance can be achieved. The multiplicative update algorithms such as EG [11] nd FPL a [10] attain regret which is proportional to O(R T log n) where R is a bound on the  norm of the cost functions. By analogy with the online linear optimization case, for a sequence of cost vectors f1 , f2 , . . . , fT  Rn , where ft (i)


1.4 The stationary stochastic setting vs. an adversary A point made by [4] is that the variation bounds on the regret essentially match the performance of a natural algorithm in the stochastic setting in which the payoffs are generated by a stationary stochastic process. Let us give a rough sketch of why this is true. Consider a setting of online linear optimization over the unit ball. Suppose that the cost functions are generated by a stationary stochastic process, such that in each iteration the cost function is independently sampled from a fixed distribution with some mean vector µ. For a long enough sequence of cost functions drawn from this distribution, the best point in hindsight is essentially the least cost point with respect to the cost vector µ. Let µ be the observed mean of samples. The natural al¯ gorithm uses µ as proxy for the actual mean and chooses ¯ its point with µ as a cost vector, and this can be shown to ¯ be optimal. It is a standard fact that the variance of µ de¯ creases inversely with the number of samples. Thus, if  2 is the variance of the distribution, then the variance of µ af¯ 2 ter t iterations is t . The expected regret on iteration t is  proportional to the standard deviation t , and thus the total T regret of the optimal predictor is on the order of t=1 t =   O(  2 T ) = O( VART ). Hence, the optimal achievable regret in this simple setting is proportional to square root of the total variation. In the sequel we prove that the same regret (up to constant factors) can be achieved in the fully adversarial setting, i.e. in a setting in which the cost functions are chosen completely adversarially. In the stationary stochastic setting, the average cost converges to the average optimum cost at a speed that depends on the variance of the distribution: lower variance implies faster convergence. Hence, by proving the variation bounds on the regret, we give strong indication that online linear optimization in the adversarial setting is as efficient as in the stationary stochastic setting. 1.5 A brief history of prediction It is incredible that as early as the late fifties, Hannan [7] devised an efficient algorithm for online decision making. Hannan's algorithm proceeds by adding a perturbation to the costs of actions seen so far, and choosing the action with least cost (taking into account the perturbations). He proves that theegret of an online player using his algorithm grows r like O( T ) where T is the number of prediction iterations. Since then much water has flown under the bridge and many experts have predicted: this includes the aforementioned influential multiplicative update family of algorithms [12, 13, 6], Cover's universal portfolio prediction problem [5] and the extensions of Follow-The-Perturbed-Leader [10] to online optimization and complex decision problems such as online shortest paths. The machine learning community has extended these fundamental results into a beautiful theory of general prediction using Bregman divergences and generalized projections (in order to do justice to the numerous contributors we refer to the credits in the comprehensive book of [3]). This work refined upon the basic regret bound  of O( T ). This refinement, however, deals with the con stants multiplying the T term.

Freund and Schapire [6] showed that a Multiplicative Weights algorithm based on theR eighted Majority algorithm W , T  ) log n attains regret bounds of O where t=1 ft (i it is assumed that all costs are in the range [0, R], and i is the best expert in hindsight. In the case when the costs lie in the range [-R, R], Allenberg-Neeman and Neeman [1] showed that there is an expert i such that the regret can be bounded by R . T O Most recently Cesa-Bianchi, t=1 |ft (i)| log n Mansour and Stoltz [4] gave the first second-order w gret re Amax bounds: they proved a bound of O log n here T t Amax = maxtT {  =1 f ( t)2 } is the maximum, over all T the time periods t, of the sum of squares of losses up to time t of the best expert at time t. They suggest, and indeed as we argue in the previous section it makes intuitive sense, that the V max it should be possible to get a bound that scales as ART . In this paper we prove their conjecture to be correct, in effect providing the optimal regret bounds up to constant factors.

2

Notation and background

The following definitions and derivations may be familiar to experts in learning theory, who may wish to proceed directly to the next section. In the online linear optimization problem, the decision space is a closed, bounded, convex set K  Rn , and we are sequentially given a series of linear cost functions ft : K  R for t = 1, 2, . . .. With some abuse of notation, we also write the functions as ft (x) = ft · x for some vector ft  Rn . The algorithm iteratively produces a point xt  K in every round t, without knowledge of ft (but using the past sequence of cost functions), and incurs the cost ft (xt ). The regret at time T is defined to be Regret(f1 , f2 , . . . , fT ) := tT
=1

ft (xt ) - min
xK

tT
=1

ft (x).

Usually, we will drop the cost vectors from the regret notation when they are clear from context. For convenience, we t- 1 define f0 = 0, and let Ft =  =0 f . We proceed to describe a widely used algorithmic technique in online learning, on the basis of which we will derive our algorithms. Since our goal is to get regret bounded by the variation in the cost sequence, intuitively, a Follow-The-Leader (FTL) type algorithm, which always chooses the best point so far to use in the next round, should perform well if the variation is low. The FTL algorithm by itself doesn't usually guarantee low regret, mainly because it is inherently unstable: it may swing wildly from one point to another from one iteration to the next at very little provocation (for example, consider the case of expert prediction with 2 experts for the following sequence of cost vectors: (1/2, 0), (0, 1), (1, 0), (0, 1), . . .). To make it stable, we add a strictly convex regularization function R(x) before computing the leader. The generic algorithm which results is shown below, and is called Follow The Regularized Leader (FTRL):


Algorithm 1 FTRL 1: Let K be a convex set 2: Input: parameter  > 0, regularization function R(x). 3: for t = 1 to T do F , 1 4: Use xt minxK t · x +  R(x) 5: Receive ft 6: end for A crucial observation regarding the FTRL algorithm which we use in the analysis is its equivalence to the following algorithm, which we call Follow the Lazy Projected Leader (FLPL). This algorithm maintains an auxiliary sequence of points which are updated using a gradient descent type algorithm, which are then projected into the convex set using the Bregman divergence B R defined by R: B (x, y ) = R(x) - R(y ) -
R

Since R(yt ) and Ft · yt are constants (i.e. independent of x), BR (x, yt ) is minimized at the point x that minimizes R(x) +  Ft · x, which implies that F . 1 arg min BR (x, yt ) = arg min t · x + R(x) xK xK  One important property which follows from the first characterization of xt is the following standard bound on the regret, due to Kalai and Vempala [10], called the Follow-TheLeader/Be-The-Leader (FTL-BTL) inequality: Lemma 2 The regret of the FTRL (or equivalently, the FLPL) algorithm is bounded as: Regret  tT 1 ft · (xt - xt+1 ) + [R(xT ) - R(x0 )].  =1

R(y) · (x - y ).

The algorithm as it is given has an implicit update, whose implementation we ignore for now (in this paper we are only concerned with the Euclidean and Relative Entropy divergences, in which case the updates are efficient). Algorithm 2 FLPL 1: Let K be a convex set 2: Input: parameter  > 0, regularizer function R(x). 3: for t = 1 to T do 4: If t = 1, choose y1 such that R(y1 ) = 0. 5: If t > 1, choose yt such that R(yt ) = R(yt-1 ) -  ft-1 . 6: Project according to B R : xt = arg min B R (x, yt )
xK

3

Algorithms and main results

In this section we describe the algorithms for which we prove variation bounds, and state formally their performance guarantees. 3.1 Online linear optimization We start by describing our result for online linear optimization. Following the notation defined in the previous section, we assume that K  Bn , where Bn is the unit ball in Rn , and that 0  K . This is without loss of generality, and can be assumed by a suitable scaling and translation of K . Scaling K down by its diameter D makes the diameter 1 and scales the regret down by D as well, and changing the coordinate system so that K contains the origin doesn't change the regret bound. Here, we are making use of the translation invariance of our regret bounds. We also assume that for all t, ft  1. If we have some other bound R on ft , then we scale down the ft 's by R to get new cost vectors ft such that ft  1. We can then run the algorithm pretending as if ft is the sequence of cost vectors. Define the variation of sequence of cost vectors f1 , . . . , fT to be VART (f1 , f2 , . . . , fT ) = tT
=1

7: end for

In fact, the two algorithms above are identical. This is perhaps not surprising, given what is known about the so called "mirror-descent" algorithm (e.g. [3]). Nevertheless this fact is crucial for our later derivations, and we did not find this precise statement elsewhere, hence we include a short proof. Lemma 1 The two algorithms above produce identical predictions, i.e. F = 1 arg min · x + R(x) arg min B R (x, yt ). t xK xK  Proof: First, let us oFserve that the uns onstrained optimum b c 1  n x = arg minxR atisfies t · x +  R(x) Ft + 1 R(x ) = 0 

ft - µ 2,

By induction, the above equation is also satisfied for yt . Since R(x) is assumed to be strictly convex, there is only one solution for the above equation and thus yt = x . Hence, BR (x, yt ) = R(x) - R(yt ) - R(yt ) · (x - yt ) = R(x) - R(yt ) +  Ft · (x - yt ).

T 1 where µ = T t=1 ft is the vector that minimizes the above expression. Usually, we will drop the cost vectors from the notation for the variation, refer to it simply as VART , when the cost vectors are clear from context. To see that scaling has no effect on the regret bound, note 1 VART (f1 , . . . , fT ) = 2 VART (f1 , . . . , fT ), R and 1 Regret(f1 , . . . , fT ) = Regret(f1 , . . . , fT ). R V Thus, if Regret(f1 , . . . , fT ) = O( ART (f1 , . . . , fT )), then V ART (f1 , . . . , fT )). This is exRegret(f1 , . . . , fT ) = O( actly the rescaling invariance discussed earlier.


For ease of notation, we define f0 = 0, and for any t > 0, t-1 t-1 1 1 let Ft =  =0 f and µt = t Ft = t  =0 f . We instantiate the FTRL/FLPL algorithm with the regularization function R(x) = 1 x 2 . This regularization was considered 2 many times before, and the only change hereby is to choose a different "learning rate"  , which will enable us to prove the novel regret bounds. Since R(x) = x for this regularization, the algorithm that results is: Algorithm 3 Lazy Projection 1: Let K be a convex set 2: Input: parameter  > 0. 3: for t = 1 to T do 4: If t = 1, choose y1 = 0. 5: If t > 1, let yt = yt-1 -  ft-1 . 6: Use xt = arg minxK x - yt . 7: end for Our main theorem with respect to online linear optimization is: Theorem 3 Let ft , for t = 1, 2, . . . , T , be a sequence of cost vec rs to the experts so that ft  1. Setting  = to min{2/ VART , 1/6}, the regret of the Lazy Projection algorithm is bounded by V Regret  16 ART . Of course, an upper bound on the total variation VART may not be known in advance. Even so, standard  -halving tricks (start with  = 1/6, and halve  as soon as the variation quadruples, and restart the algorithm) immediately give  an O( VART ) regret bound. The formal application of this standard trick is omitted from this extended abstract. 3.2 Prediction from expert advice In the expert learning problem, we assume that we have access to n experts. In each round t, we choose a distribution xt over the experts and choose an expert from it. Then, we obtain a cost vector ft which specifies a cost ft (i) for every expert, and we incur the cost of the chosen expert. Our goal is to bound the total expected cost of the algorithm (i.e. T t=1 ft · xt ) relative to the total cost of the expert with minT imum total cost in hindsight (i.e. mini t=1 ft (i)). For simplicity, we assume that all costs ft (i)  [0, 1]. This can be assumed without loss of generality from the rescaling and translation invariance of our final regret bounds. In general, all we need is bound R on the maximum value of ft (i) - ft (j ) over all rounds and all pairs of experts i, j . In each round, the algorithm can be run by scaling the costs of all experts down by R, and then subtracting out the minimum cost in each round. As in the case of online linear optimization, this scaling and translation doesn't affect the square-root variation bound on the regret. As mentioned before, this setting is a special case of the online linear optimization where the domain K is the simplex (denoted ) of distributions over the experts. To design an algorithm for this special case, we need a different regui larization function, ne(x) = xi ln xi - xi .The Bregman

divergence which arises from this is the un-normalized relative entropy (c.f Herbster and Warmuth [9]), defined on Rn , + called as follows: i yi Dne (x, y ) := yi · ln + yi - xi . xi Note that when x, y  , Dne (x, y ) is the relative entropy between x and y , and ne(x) is the negative entropy of x. The Bregman projection on the simplex with the un-normalized relative entropy divergence is implemented simply by scaling all the coordinates so that they sum to 1. A significant twist on the usual multiplicative weights algorithm is that we modify the cost functions to explicitly take into account the variation: we actually run the FTRL/FLPL ~~ algorithm on the sequence of cost vectors f1 , f2 , . . . where f , ~ ft (i) = t (i) + 4 (ft (i) - µt (i))2 t- 1 where µt = 1  =0 f . As before, we use the convention t that f0 = 0. For ease of notation, for a vector x, we define the vector ~ x2 as x2 (i) = x(i)2 . Thus, we can write ft compactly as ~ = ft + 4 (ft - µt )2 . The algorithm which results is ft given below: Algorithm 4 Variation MW 1: Input: parameter  > 0. 2: for t = 1 to T do 3: If t = 1, choose y1 = 1, the all 1's vector. ~ 4: If t > 1, let yt (i) = yt-1 (i) exp(- ft-1 (i)), where ~ -1 (i) = ft-1 (i) + 4 (ft-1 (i) - µt-1 (i))2 . ft i yt (i). 5: Use xt = yt /Zt , where Zt = 6: end for Define VARmax = max{VARt ( t)}, T
t T

where t is the best expert till the tth round, and VARt (i) = t t 1 t 2 t  =1 (ft (i) - µ (i)) where µ (i) = t  =1 f (i) is the th th mean cost of the i expert till the t round. Our main result concerning prediction from expert advice is Theorem 4 Let ft , for t = 1, 2, . . . , T , be a sequence of cost vectors to the experts so that ,ft (i)  [0, 1]. Setting  = l min og(n)/4VARmax , 1/10 the regret of the Variation T MW algorithm is bounded by V max Regret  8 ART log(n) + 10 log(n). Again,  -halving tricks can be used to obtain this result in the case when VARmax is not known ahead of time. T The additive log(n) term is inherent in all expert learning algorithms and also appears in all previously known regret bounds.


4

Analysis of the Lazy Projection algorithm

In this section we prove Theorem 3. The proof uses the dual characterization of the FTRL type algorithms introduced previously: on one hand we follow the standard methodology of the Follow-The-Leader type algorithms, bounding the regret by distance between consecutive predictions. On the other hand we use the fact that these predictions are projections of aggregate cost functions, and analyze the distance between successive projections. In fact, this latter analysis is the main crux of the proof - we refine previous approaches by giving a tighter bound on this distance which is based on simple geometrical intuition. Proof: (Theorem 3) In order to aid understanding, we present the proof as a series of lemmas. We defer the proofs of the lemmas to after the present proof. We start by invoking the FTL-BTL inequality (Lemma 2) to obtain the following bound: Lemma 5 1 Regret  (ft - µt ) · (xt - xt+1 ) + .  =1 We proceed to relate the distance between successive projections to the variation in the cost vectors. This lemma is the main crux of the proof, and is based on the geometric intuition depicted in Figure 1. The idea in the proof is that if the sequence of cost vectors has low variation, then the cumulative cost vector Ft is far away from the convex body, and in such a case, the distance between successive projections can be bounded in terms of the length of the component of ft orthogonal to Ft , which can in turn be bounded in terms of ft - µt , since µt = 1 Ft . t Lemma 6 For all t, we have: xt - xt+1 3 2  ft - µ t + . 2 t tT

For ease of notation, we define a parameter of the cost vectors which will be further used in the analysis: (T ) := tT 1 ft - µt . t =1

This parameter measures the variation of the cost vectors. Using the Cauchy-Schwartz inequality and Lemma 6 we get (ft - µt ) · (xt - xt+1 )  3  2  ft - µt · ft - µt + 2 t 3 2 ft - µt t ft - µt 2 + . 2 Plugging this into the regret bound of Lemma 5 gives us the following bound:
T 3 t 1 Regret  ft - µt 2 + 2(T ) + . 2 =1 

(1)

To proceed from here, we use the following Lemma (which, curiously enough, is proved using the analysis of an online learning algorithm that has nothing to do with the present setting!): Lemma 7 For any vector µ, we have: tT
=1

ft - µt 2 

tT
=1

ft - µ 2 + 4(T ).

Plugging into equation (1) we get that, for any vector µ (and T 1 in particular, for µ = µT := T t=1 ft ), Regret 
T 3 t 1 ft - µ 2 + (2 + 6 )(T ) + . 2 =1 

yt

yt+1

We can bound on (T ) as follows:
- ft


xt xt+1

Lemma 8 For any vector µ, we have: T t ft - µ 2. ( T )  3
=1

0

 Finally, by setting  = min{2/ VART , 1/6}, the proof is complete. We now give the omitted proofs of Lemmas used in the above proof. Proof: (Lemma 5) By definition of xt , we know that Ft · xt + 1 1 xt 2  Ft · xt+1 + xt+1 2. 2 2

K

Figure 1: The distance between successive projections, viz. xt - xt+1 , is bounded by the length of the component of - ft orthogonal to the yt - xt .


Recall that µt = Ft /t. Hence, tT
=1

µt · (xt - xt+1 ) = 

t T Ft · (xt - xt+1 ) t =1

tT 1 1 · ( xt+1 2 - xt 2) t 2 =1 1 + T xT +1 2 1t 1  xt 2 · - 2 =2 t-1 t 2 T  1 . 2

Let x be the projection of ftv on the subspace spanned by u (i.e. x = (ftv · u)u). Then, since ftu is the projection of ft in the subspace spanned by u, it is the closest point to ft in the subspace, and since x is also in the subspace, we have ft - ftu  =  ft - x ft - ftv ft - ftv ft - ftv ft - µt + + + + ftv - x ftv sin() ft / Ft ft / Ft .

Here, we use the fact that xt  1. The stated bound then follows from Lemma 2. Proof: (Lemma 6) We split up the analysis in two cases: 1. Ft  2/ : Assume that Ft > 0. Since xt and xt+1 are the projections of yt and yt+1 respectively on K , by the Projection Lemma 9 we have xt - xt+1  yt - yt+1 =   ft   ft - µt +  µt 2Ft µ  ft - µt + = t 2  ft - µt + . t

The last inequality follows because ftv is the closest point to ft in the subspace spanned by v , and µt is a point in this subspace. Plugging this bound into (3), we get (2). Now, we have the following bound on ft / Ft : Ft f
t

ft - µF + µt t 
t

 1 ft - µt + . (4) 2 t

Plugging (4) into (2), we get the required bound. Proof: (Lemma 7) We may assume that µ  1, since the right hand side is T 1 minimized at µ = T t=1 ft . The statement of the lemma is essentially bounding the regret of the FTL algorithm played on the sequence of cost functions ct (x) = x - ft 2, for t = 0, 1, 2, . . . , T , with the convex domain the unit ball Bn . This is because the leader in round t is arg min {
x  Bn t -1 =0

If Ft = 0, then the Projection Lemma 9 implies that xt - xt+1   ft =  ft - µt , so the stated bound still holds. 2. Ft  2/ : we first show the following bound: xt - xt+1   ft - µt + ft / Ft . (2)

x - f 2} =

1 f = µt . t =0

t-1

We assume here that the first point played by the algorithm is 0. Then by the FTL-BTL inequality (Lemma 2), the regret of the FTL algorithm can be bounded as (here, the regularization function R(x) is null): Regret  c0 (0) - c0 (µ1 ) +   tT
=1

Consider two unit vectors: u in the direction yt -xt , and v in the direction yt . We claim that the sine of the angle  between these vectors is at most 1/ Ft . To see this, consider the triangle formed by the points 0, xt , yt . We are interested in the angle  at vertex yt (see Figure 1). Let  be the angle at xt . By the law of sines, we have sin() = xt sin() yt  1 yt 1 =  Ft ,

tT
=1

ct (µt ) - ct (µt+1 ) ( ct is convex)

ct (µt ) · (µt - µt+1 ) ct (µt ) µt - µt+1

tT
=1

where the inequality follows because xt  1 and sin()  1. Now, we consider the components of ft along u and v : define ftu = (ft · u)u and ftv = (ft · v )v . Consider the point yt -  ftu . Since it lies on the line joining yt to xt , its projection on K is also xt . Here, we use the fact that yt -  ftu is outside K : this is because yt -  ftu xt+1 -xt  yt -  ftu   Ft -   1. By the Projection Lemma 9, we have  yt+1 -(yt - ftu ) =  ft -ftu . (3)

tT
=1

2(ft - µt ) · µt - µt+1 .  = µ
t

Now, we have µt - µt+1

-

tµt + ft t+1

1 ( µt + ft ) t+1 2 . t


Thus, the regret is bounded by 4(T ). Proof: (Lemma 8) We may assume without loss of generality that µ = 0: using the vectors ft - µ instead of ft doesn't change the value of (T ). We have (T ) =  tT 1 ft - Ft /t t =1 tT 1 f 1  Ft t+ t t =1 1 t-1 tT 1 ft + 2  f t t =1 =1 T  t2 1 1 ft T  2 t t t =t+1 =1 T T ( t t 4 ft 2 Cauchy-Schwarz) t2 =1 =1 T t 3 ft 2,
=1

Next, we make crucial use of the fact that the Multiplicative Weighs algorithm puts exponentially higher weight on experts with lower cost than those with higher costs. Since we explicitly factor in the variation in the costs of each expert before computing their exponential weights, eventually the algorithm starts to concentrate all the weight on experts with lower cost and lower variation. This yields the desired regret bound. We now describe a regret bound on the performance of the Multiplicative Weights algorithm. This bound is wellknown (see, for e.g. [4, 2]), we include the short proof for completeness. Lemma 10 Suppose in round t of the expert prediction problem, expert i incurs cost gt (i), where |gt (i)|  M . Consider the Multiplicative Weights algorithm, that in round t chooses t-1 expert i with probability xt (i)  exp(-  =1 g (i)). Then, if   1/M , Regret   tT
=1 2 gt · xt +

log n . 

as required. The projection lemma which follows is a well-known fact from convex optimization theory. We include the proof for completeness. Lemma 9 (Projection lemma) Let K be a convex set, and let x and y be any two points. Let x and y be their respective projections on K . Then x
-

t-1 Proof: Let wt (i) = exp(-  =1 g (i)), and let Zt = i wt (i). Then the distribution on the experts at time t is exactly wt /Zt . We think of Zt as a potential function, and track how it changes over time. Initially, Z1 = n. We have i Zt+1 = wt (i) exp(- gt (i))  i wt (i)(1 -  gt (i) +  2 gt (i)2 ) (6)

2 = Zt (1 -  (gt · xt ) +  2 (gt · xt )) 2  Zt exp(- (gt · xt ) +  2 (gt · xt )).

y
,


x-y .

Proof: Assume that x y otherwise the inequality is trivial. By the properties of projections on convex sets, we have (x - x ) · (y
-

=

x )  0 and (y - y ) · (x

-

y

)

 0. (5)

In (6), we used the fact that for |x|  1, we have exp(x)  1 + x + x2 . Thus, by induction, we have -T . t tT 2 2 ZT +1  n exp  (gt · xt ) +  (gt · xt )
=1 =1

Consider the line passing through x and y , and consider the projections x and y of x and y respectively on this line. The inequalities (5) imply that along , the order of the points is (x , x , y , y ). Thus, we have x
-

Also, for any expert i we have the bound -T .  ZT +1  wT +1 (i) = exp  g (i)
=1

y


x

-

y


x-y ,

where the last inequality follows because the projection of any line segment on any line is no longer than the segment itself.

Putting these two inequalities together, taking logarithms and simplifying, we get the desired bound on the regret. For our analysis, we use a slightly different notion of variation of the experts' costs: for any round t and any expert i, define t -1 Qt (i) = (f (i) - µ (i))2 .
=1

5

Analysis of the Variation MW algorithm

The analysis of the Variation MW is straightforward, though complicated somewhat due to heavy algebraic manipulations. We outline the main ideas in the analysis now. Our starting point is Lemma 10, a well-known bound which relates the regret of the Multiplicative Weights algorithm with the expected squared losses of the experts (the expectation being taken under the distributions generated by the algorithm).

Recall that the usual definition of variation of an experts cost up to the tth round is simply VARt (i) = t (f (i) - µt (i))2 ,

=1


t where µt (i) = 1  =1 ft (i). But it is easily seen from (the t 1 dimensional version of) Lemmas 7 and 8 that V Qt (i)  VARt (i) + 12 ARt (i). (7) and thus Qt (i) can serve as a proxy for the true variation (up to constant factors). Recall that t is the best expert till time t, and VARmax = T maxtT {VARt ( t)}. Define Qmax = maxtT Qt ( t). Then, T we have that Qmax  4VARmax , T T max assuming that VART  16. Then, the following Lemma combined with inequality (7) implies Theorem 4. Lemma 11 Let ft , for t = 1, 2, . . . , T , be a sequence of cost vectors to the experts so that ft (i)  [0, 1]. Let t be the best expert at time t, and let Q be an upper bound on Qmax = T l maxt {Qt ( t)}. Then setting  = min{ og(n)/4Q, 1/10}, the regret of the Variation MW algorithm is bounded by Q Regret  4 log(n) + 10 log(n). ~ Proof: Define gt = ft - t 1, where t = µt ( t)+ 4t Qt ( t), 1 is the all 1's vector. Note that for any i, and - t-1 = - t-1 ,   1 ~ (i) exp  g (i) exp  f Z =1 =1 where Z is a scaling constant independent of i. Hence, scalt- 1 ing either the weights exp(-  =1 g (i)) or the weights t-1 ~ exp(-  =1 f (i)) to sum up to 1 yields the same distribution, viz. xt . Since we assumed that the ft (i)  [0, 1], we conclude that gt (i)  [-2, 2] (since 4  1). Applying Lemma 10 to the sequence of cost vectors gt , we get the following regret bound, where T is the final best expert: tT
=1

Lemma 12 If   1/10, then for any i, we have
2 gt (i) - 4(ft (i) - µt (i))2  2(µt (i) - t )2 .

Plugging this bound into (8), we get that Regret  2 tT
=1

(µt - t 1)2 · xt +

log n + 4 (Q + 1). (9) 

T We now proceed to bound t=1 (µt - t 1)2 · xt . We bound each term in the summation separately. For any t  log n  , we simply bound |µt (i) - t |  2 and hence we have (µt - t 1)2 · xt  2. g Now let t > lo n . For convenience of notation, we drop the subscript t from xt (i) and refer to them as x(i). (µt - t 1)2 · x i i = (µt (i) - t )2 x(i) +
:µt (i)t

(µt (i) - t )2 x(i) (µt (i) - t )2 x(i) (10)

i 
:µt (i)t

4

2 i  Qt ( t) x(i) + t 2 + i
:µt (i)>t

:µt (i)>t

:µt (i)>t

4 

 Qt ( t) t

(µt (i) - t )2 x(i)

(11)

Here, (10) follows because when µt (i)  t = µt ( t) + 4 4 t Qt ( t), we have |µt - t |  t Qt ( t) since µt (i)  µt ( t). We now bound each term of (11) separately. The proof of the following lemma is a straightforward calculation and we defer it to after the present proof. Lemma 13 The first term of (11), summed over all t, can be bounded as: 2 t T 4 Qt ( t)  32 2 Q. t =1 The hard part is to bound the second term of (11). We now proceed to do so. The intuition in the following analysis is that the Variation MW algorithm tends to concentrate exponentially high weight on the experts that have low cost. Let I be the index set of all i such that µt (i) > t . Note that t  I . Now, we have x(i)  exp(- tµt (i) - / 4 2 Qt (i)), and thus x( t)  exp(- tt ). Thus, x(i) can be written as: x(i) = = exp(- tµt (i) - 4 2 Qt (i)) j 2 exp(- tt ) + = t exp(- tµt (j ) - 4 Qt (j )) (i) exp(- t(µt (i) - t )) j , 1+ = t (j ) exp(- t(µt (j ) - t ))

~ ft · xt -

tT
=1

~ ft ( T )  

tT
=1

2 gt · xt +

log n .  T
t=1

Here, we used the fact that the t=1 t 1 · xt = ~ Simplifying using the definition of ft , we get tT
=1

T

t .

ft · xt -  tT
=1

tT
=1

ft ( T ) log n  tT
=1

2 gt · xt +

- 4  tT

tT
=1

(ft - µt )2 · xt + 4
2

(ft ( T ) - µt ( T ))2

2 [gt

=1

log n , - 4(ft - µt ) ] · xt + 4 (Q + 1) +  (8)

T since t=1 (ft ( T ) - µt ( T ))2  QT ( T ) + 1  Q + 1. The following lemma bounds the first term in (8). The proof is a straightforward calculation, and so we defer its proof to after the present proof.

where (i) = exp(-4 2 Qt (i)). Note that all (i)  (0, 1]. Define, for all i, d(i) = (µt (i) - t ). Note that for i  I , d(i)  [0, 1]. Thus, we have i
I

d(i)2 x(i) =

i
I

(i)d(i)2 exp(- td(i)) j . 1+ = t (j ) exp(- td(j ))


i 2 To upper bound I d(i) x(i), we can neglect the factors in the denominator which depend on i  I  { t}; this only / increases the value. Let dI and I be the vectors d and  restricted to the index set I . Define the function h : (0, 1]|I | × [0, 1]|I |  R as i (i)d(i)2 exp(- td(i)) j h(I , dI ) = . 1+ I (j ) exp(- td(j ))
I

Here, inequality (12) follows because |ft (i) - µt (j )|  1 for any i, j , and |ft (i) - t |  2, inequality (13) follows from the fact that (a + b)2  2a2 + 2b2 for any real numbers a, b, and inequality (14) follows since 16 + 16 2  2 if   1/10. The lemma follows. Proof: (Lemma 13) t-1 Note that for t  Q, Qt ( t) =  =1 (f (i) - µ (i))2  t, and for t > Q, Qt ( t)  Q. Thus we have   2 t T 4 t t Q2  Qt ( t)  16 2 · 12 +  32 2 Q. t t2 =1
Q >Q

This maximum value of this function on its domain gives an upper bound on the expression above. Lemma 14 For t > [0, 1]|I | , we have
log n ,

and for any (I , dI )  (0, 1]|I | × 2 log2 n .  2 t2

h(I , dI ) 

Putting Lemmas 13 and 14 together, we have that 2 tT t t T 4 (µt - t 1)2 · xt  2+ Qt ( t) t log n =1 =1


Proof: Lemma 14) g Let S = {i : d(i)  lotn }, and let S I I bound h( , d ) as follows: h(I , dI )  i
S

=

I \ S . We upper

+

t
g > lo n

2 log n  2 t2

2

+

i

(i)d(i)2 exp(- td(i)) j S (j ) exp(- td(j )) (i)d(i)2 exp(- td(i))


S

4 log n .  Plugging this bound into (9), we get log n Regret  + 64 3 Q + 8 log(n) + 4 (Q + 1).  l Now, if we set  = { og n/4Q, 1/10}, we get that the regret is bounded by Q Regret  4 · log n + 10 log(n).  32 2 Q + Again, it may not be possible to get an upper bound on Qmax a priori, but we can use the same  -halving idea (start T with  = 1/10, and halve  as soon as this maximum quadruples, and restart the algorithm) and get regret that bounded by Q . max log(n) + log(Qmax ) log(n) Regret  O T T The details of this bound are standard and are hence omitted from this extended abstract. We now give the omitted proofs of Lemmas 12, 13, and 14. Proof: (Lemma 12) We have: gt (i)2 = (ft (i) - t + 4 (ft (i) - µt (i))2 )2 = (ft (i) - t )2 + 8 (ft (i) - t )(ft (i) - µt (i))2 + 16 2 (ft (i) - µt (i))4  (ft (i) - t )2 + (16 + 16 2 )(ft (i) - µt (i))2
2 2

 max
iS

(i)d(i)2 exp(- td(i))  (i) exp(- td(i)) og2 n exp (- log n)  2 t2 l

( 15) (16)

+ 

i
S

2 log2 n .  2 t2
Pn a

i i In (15) we use the inequality Pi=1 bi  maxin ai for posn b i=1 itive reals ai and bi . In (16), we used the following facts (a) (i)  1, and (b) the function x2 exp(- tx) has a negative 2 derivative (and is thus decreasing) when x > t , and thus its

g maximum over the range [ lotn , 1] is obtained at

log n t .

6

Conclusions and Future Work

(12)

 2(µt (i) - t ) + (2 + 16 + 16 )(ft (i) - µt (i))2 (13)  2(µt (i) - t )2 + 4(ft (i) - µt (i))2 . (14)

In this paper, we investigated the possibility of bounding the regret of online learning algorithms by terms which depend on the variation of the cost sequence, rather than the number of prediction rounds. We analyzed two algorithms, Lazy Projection and Variation MW, and showed that these algorithms obtain variation-bounded regret. Such bounds are significant not only because they show that it is possible to suffer much less regret than previously believed when the cost sequence is particularly benign, but also because they match the regret bounds of natural regret minimizing algorithms in the stochastic setting of independent cost functions from a fixed distribution. We believe that this work opens up many new directions for future research, all related to bounding the regret in terms of the variation of the cost sequence in the various different scenarios in which regret minimizing algorithms have been devised: bandit settings, strictly convex cost functions, online convex optimization and so on. We conjecture in all such scenarios, it is possible to get variation-bounded regret.


Specifically, we conjecture that any dependence on T , the number of prediction rounds, in the regret bound can be replaced by the same dependence on the variation of the cost sequence. In other scenarios, the variation needs to be defined carefully in settings in which it is not natural or obvious, such as in the case of online convex optimization.

Acknowledgements
We thank Martin Zinkevich for initial discussions on the possibility of variation bounds on the regret.

References
[1] Chamy Allenberg-Neeman and Benny Neeman. Full information game with gains and losses. In 15'th International Conference on Algorithmic Learning Theory, 2004. ` [2] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48­77, 2003. ` ´ [3] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. ` [4] Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for prediction with expert advice. Mach. Learn., 66(2-3):321­352, 2007. [5] T. Cover. Universal portfolios. Math. Finance, 1:1­19, 1991. [6] Yoav Freund and Robert E. Schapire. A decisiontheoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119­ 139, 1997. [7] James Hannan. Approximation to bayes risk in repeated play. In M. Dresher, A. W. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume III, pages 97­139, 1957. [8] D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Relative loss bounds for single neurons. IEEE Transactions on Neural Networks, 10(6):1291­1304, November 1999. [9] Mark Herbster and Manfred K. Warmuth. Tracking the best linear predictor. Journal of Machine Learning Research, 1:281­309, 2001. [10] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291­307, 2005. [11] Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Inf. Comput., 132(1):1­63, 1997. [12] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212­261, 1994. [13] V. Vovk. A game of prediction with expert advice. J. Comput. Syst. Sci., 56(2):153­173, 1998. [14] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, pages 928­936, 2003.