Mind the Duality Gap: Logarithmic regret algorithms for online optimization

Sham M. Kakade Toyota Technological Institute at Chicago sham@tti-c.org

Shai Shalev-Shwartz Toyota Technological Institute at Chicago shai@tti-c.org

Abstract
We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for Follow-The-Leader and for the gradient descent algorithm proposed in Hazan et al. [2006]. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions.

1

Introduction

In recent years, online regret minimizing algorithms have become widely used and empirically successful algorithms for many machine learning problems. Notable examples include efficient learning algorithms for structured prediction and ranking problems [Collins, 2002, Crammer et al., 2006]. Most of these empirically successful algor ms are based on algorithms which are tailored to genith eral convex functions, whose regret is O( T ). Rather recently, there is a growing body of work providing online algorithms for strongly convex loss functions, with regret guarantees that are only O(log T ). These algorithms have potential to be highly applicable since many machine learning optimization problems are in fact strongly convex -- either with strongly convex loss functions (e.g. log loss, square loss) or, indirectly, via strongly convex regularizers (e.g. L2 or K L based regularization). Note that in this later case, the loss function itself may only be just convex but a strongly convex regularizer effectively makes this a strongly convex optimization problem (e.g. the SVM optimization problem uses the hinge loss with L2 regularization). The aim of this paper is to provide a template for deriving a wider class of regret-minimizing algorithms for online strongly convex programming. Online convex optimization takes place in a sequence of consecutive rounds. At each round, the learner predicts a vector wt  S  Rn , and the environment responds with a convex loss function, t : S  R. The goal of the learner is to minimize the difference between his cumulative loss and T T the cumulative loss of the optimal fixed vector, t=1 t(wt ) - minwS t=1 t(w). This is termed `regret' since it measures how `sorry' the learner is, in retrospect, not to have predicted the optimal vector. Roughly speaking, the family of regret minimizing algorithms (for general convex functions) can be seen as varying on two axes, the `style' and the `aggressiveness' of the update. In addition to online algorithms' relative simplicity, the empirical successes are also due to having these two knobs to tune for the problem at hand (which determine the nature of the regret bound). By style, we mean updates which favor either rotational invariance (such as gradient descent like update rules) or sparsity (like the multiplicative updates). Of course there is a much richer family here, including the Lp updates. By the aggressiveness of the update, we mean how much the algorithm moves its decision to be consistent with most recent loss functions. For example, the preceptron algorithm makes no update 1


when there is no error. In contrast, there is a family of algorithms which more aggressively update the loss when there is a margin mistake. These algorithms are shown to have improved performance (see for example the experimental study in Shalev-Shwartz and Singer [2007b]). While historically much of the analysis of these algorithms have been done on a case by case basis, in retrospect, the proof techniques have become somewhat boilerplate, which has lead to growing body of work to unify these analyses (see Cesa-Bianchi and Lugosi [2006] for review). Perhaps the most unified view of these algorithms is the `primal-dual' framework of Shalev-Shwartz and Singer [2006], Shalev-Shwartz [2007], for which the gamut of these algorithms can be largely viewed as special cases. Two aspects are central in providing this unification. First, the framework works with a complexity function, which determines the style of algorithm and the nature of the regret guarantee (If this function is the L2 norm, then one obtains gradient like updates, and if this function is the K Ldistance, then one obtains multiplicative updates). Second, the algorithm maintains both "primal" T and "dual" variables. Here, the the primal objective function is t=1 t(w) (where t is the loss function provided at round t), and one can construct a dual objective function Dt (ˇ), which only depends on the loss functions 1, 2, . . . t-1 . The algorithm works by incrementally increasing the dual objective value (in an online manner), which can be done since each Dt is only a function of the previous loss functions. By weak duality, this can be seen as decreasing the duality gap. The level of aggressiveness is seen to be how fast the algorithm is attempting to increase the dual objective value. This paper focuses on extending the duality framework for online convex programming to the case of strongly convex functions. This analysis provides a more unified and intuitive view of the extant algorithms for online strongly convex programming. An important observation we make is that any  -strongly convex loss function can be rewritten as i(w) = f (w) + gi (w), where f is a fixed  strongly convex function (i.e. f does not depend on i), and gi is a convex function. Therefore, after t t online rounds, the amount of intrinsic strong convexity we have in the primal objective i=1 t(w) is at least  t. In particular, this explains the learning rate of 1t proposed in the gradient descent algorithm of Hazan et al. [2006]. Indeed, we show that our framework includes the gradient descent algorithm of Hazan et al. [2006] as an important special case, in which the aggressiveness level is minimal. At the most aggressive end, our framework yields the Follow-The-Leader algorithm. Furthermore, the template algorithm serves as a vehicle for deriving new algorithms (which enjoy logarithmic regret guarantees). The remainder of the paper is outlined as follows. We first provide background on convex duality. As a warmup, in Section 3, we present an intuitive primal-dual analysis of Follow-The-Leader (FTL), when f is the Euclidean norm. This naturally leads to a more general primal-dual algorithm (for which FTL is a special case), which we present in Section 4. Next, we further generalize our algorithmic framework to include strongly convex complexity functions f with respect to arbitrary norms ˇ . We note that the introduction of a complexity function was already provided in ShalevShwartz and Singer [2007a], but the analysis is rather specialized and does not have a knob which can tune the aggressiveness of the algorithm. Finally, in Sec. 6 we conclude with a side-by-side comparison of our algorithmic framework for strongly convex functions and the framework for (non-strongly) convex functions given in Shalev-Shwartz [2007].

2

Mathematical Background

We denote scalars with lower case letters (e.g. w and ), and vectors with bold face letters (e.g. w and ). The inner product between vectors x and w is denoted by x, w . To simplify our notation, given a sequence of vectors 1 , . . . , t or a sequence of scalars 1 , . . . , t we use the shorthand 1:t = it
=1

i

and

1:t =

it
=1

i .

Sets are designated by upper case letters (e.g. S ). The set of non-negative real numbers is denoted by R+ . For any k  1, the set of integers {1, . . . , k } is denoted by [k ]. A norm of a vector x is denoted by x . The dual norm is defined as  = sup{ x,  : x  1}. For example, the i Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the L1 norm, x 1 = |xi |, is dual to the L norm, x  = maxi |xi |. 2


FOR t = 1, 2, . . . , T : Define wt = -  Update

1
1:(t-1)

t :(t-1) 1 t(wt )

Receive a function t(w) =

t w 2 + gt (w) and suffer loss 2 t+1 t+1 1 , . . . , t s.t. the following holds (t+1 , . . . , t+1 )  argmax Dt+1 (1 , . . . , t ) t 1 1 ,...,t

Figure 1: A primal-dual view of Follow-the-Leader. Here the algorithm's decision wt is the best decision
with respect to the previous losses. This presentation exposes the implicit role of the dual variables. Slightly abusing notation, 1:0 = 0, so that w1 = 0. See text.

We next recall a few definitions from convex analysis. A function f is  -strongly convex if  f (u + (1 - )v)  f (u) + (1 - )f (v) -  (1 - ) u - v 2 . 2 2 In Sec. 5 we generalize the above definition to arbitrary norms. If a function f is  -strongly convex then the function g (w) = f (w) -  w 2 is convex. 2 The Fenchel conjugate of a function f : S  R is defined as f ( ) = sup w,  - f (w) .
w S

If f is closed and convex, then the Fenchel conjugate of f is f itself (a function is closed if for all  > 0 the level set {w : f (w)  } is a closed set). It is straightforward to verify that the function f (w) is conjugate to itself. The definition of f also implies that for c > 0 we have (c f ) ( ) = c f ( /c). A vector  is a sub-gradient of a function f at w if for all w  S , we have that f (w ) - f (w)  w - w,  . The differential set of f at w, denoted  f (w), is the set of all sub-gradients of f at w. If f is differentiable at w, then  f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). The Fenchel-Young inequality states that for any w and  we have that f (w) + f ( )  w,  . Sub-gradients play an important role in the definition of the Fenchel conjugate. In particular, the following lemma, whose proof can be found in Borwein and Lewis [2006], states that if    f (w) then the Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let  f (w ) be its differential set at w . Then, , for all    f (w ), we have f (w ) + f ( ) = w. We make use of the following variant of Fenchel duality (see the appendix for more details): max -f -
(

tT
=1

1 ,...,T

t ) -

tT
=1

g (t )  min f (w) +
w

t

tT
=1

gt (w) .

(1)

3

Warmup: A Primal-Dual View of Follow-The-Leader

In this section, we provide a dual analysis for the FTL algorithm. The dual view of FTL will help us to derive a family of logarithmic regret algorithms for online convex optimization with strongly convex functions. Recall that FTL algorithm is defined as follows: wt = argmin
w t-1 i =1

i(w) .

(2)

For each i  [t - 1] define gi (w) = i(w) - i w 2 , where i is the largest scalar such that gi is 2 still a convex function. The assumption that i is  -strongly convex guarantees that i   . We can 3


therefore rewrite the objective function on the right-hand side of Eq. (2) as 1:(t-1) Pt ( w ) = w 2
2

+

t-1 i =1

gi (w) ,

(3)

t-1 (recall that 1:(t-1) = i=1 i ). The Fenchel dual optimization problem (see Sec. 2) is to maximize the following dual objective function Dt (1 , . . . , t-1 ) = - 1 2 1:(t-1) 1:(t-1) 2 -
t-1 i =1

gi (i ) .

(4)

Let (t , . . . , t-1 ) be the maximizer of Dt . The relation between the optimal dual variables and 1 t the optimal primal vector is given by (see again Sec. 2) 1 t . (5) wt = - 1:(t-1) 1:(t-1) Throughout this section we assume that strong duality holds (i.e. Eq. (1) holds with equality). See the appendix for sufficient conditions. In particular, under this assumption, we have that the above setting for wt is in fact a minimizer of the primal objective, since (t , . . . , t-1 ) maximizes the dual 1 t objective (see the appendix). The primal-dual view of Follow-the-Leader is presented in Figure 1. Denote t = Dt+1 (t+1 , . . . , t+1 ) - Dt (t , . . . , t-1 ) . t 1 1 t To analyze the FTL algorithm, we first note that (by strong duality) tT
=1

(6) tT
=1

t = DT +1 (T +1 , . . . , T +1 ) = min PT +1 (w) = min 1 T
w w

t(w) .

(7)

Second, the fact that (t+1 , . . . , t+1 ) is the maximizer of Dt+1 implies that for any  we have t 1 t  Dt+1 (t , . . . , t-1 , ) - Dt (t , . . . , t-1 ) . (8) 1 t 1 t The following central lemma shows that there exists  such that the right-hand side of the above is sufficiently large. Lemma 2 Let (1 , . . . , t-1 ) be an arbitrary sequence of vectors. Denote w = - 1:(1-1) 1:(t-1) , t let v   t(w), and let  = v - t w. Then,    gt (w) and v2 Dt+1 (1 , . . . , t-1 , ) - Dt (1 , . . . , t-1 ) = t(w) - . 2 1:t Proof We prove the lemma for the case t > 1. The case t = 1 can be proved similarly. Since t(w) = t w 2 + gt (w) and v   t(w) we have that    gt (w). Denote 2 Ż t = Dt+1 (1 , . . . , t-1 , ) - Dt (1 , . . . , t-1 ). Simple algebraic manipulations yield  2  2 1 1 Ż t = - + - gt () 1:(t-1) +  1:(t-1) 21:t 21:(t-1) 1 + 1:(t-1) 2 1 2 1:(t-1) = - w,   - - gt () 2 1:(t-1) 1:t 1:t 21:t 1 + t t w 2 2 1:(t-1) - w,   - - gt () = 2 1:t 1:t 21:t 2 B 2 t w 2 t   w, 2 tw t = + w,  - g () - + + 2 21:t 21:t 1:t A Since    gt (w), Lemma 1 thus implies that w,  - gt () = gt (w). Therefore, A = t(w). w w Ż Next, we note that B = t2+t 2 . We have thus shown that t = t(w) - t2+t 2 . Plugging 1: 1: the definition of  into the above we conclude our proof. Combining Lemma 2 with Eq. (7) and Eq. (8) we obtain the following: 4


FOR t = 1, 2, . . . , T : Define wt = -  Update

1
1:(t-1)

t :(t-1) 1
t 2

Receive a function t(w) = t+1 , . . . , t+1 t 1

w

2

+ gt (w) and suffer loss t(wt )

s.t. the following holds

t   gt (wt ), s.t. Dt+1 (t+1 , . . . , t+1 )  Dt+1 (t , . . . , t-1 , t ) 1 t t 1

Figure 2: A primal-dual algorithmic framework for online convex optimization. Again, w1 = 0. Corollary 1 Let 1, . . . , T be a sequence of functions such that for all t  [T ], t is t -strongly convex. Assume that the FTL algorithm runs on this sequence and for each t  [T ], let vt be in  t(wt ). Then, T tT tT 1t vt 2 t(wt ) - min t(w)  (9) w 2 =1 1:t =1 =1 Furthermore, let L = maxt vt and assume that for all t  [T ], t   . Then, the regret is 2 bounded by L (log(T ) + 1). 2 If we are dealing with the square loss t(w) = w - ľt 2 (where nature chooses ľt ), then it is 2 straightforward to see that Eq. (8) holds with equality, and this leads to the previous regret bound holding with equality. This equality is the underlying reason that the FTL strategy is a minimax strategy (See Abernethy et al. [2008] for a proof of this claim).

4

A Primal-Dual Algorithm for Online Strongly Convex Optimization

In the previous section, we provided a dual analysis for FTL. Here, we extend the analysis and derive a more general algorithmic framework for online optimization. We start by examining the analysis of the FTL algorithm. We first make the important observation that Lemma 2 is not specific to the FTL algorithm and in fact holds for any configuration of dual variables. Consider an arbitrary sequence of dual variables: (2 ), (3 , 3 ), . . . , (T +1 , . . . , T +1 ) 1 1 2 1 T and denote t as in Eq. (6). Using weak duality, we can replace the equality in Eq. (7) with the following inequality that holds for any sequence of dual variables: tT
=1

t = DT +1 (T +1 , . . . , T +1 )  min PT +1 (w) = min 1 T
w w

tT
=1

t(w) .

(10)

A summary of the algorithmic framework is given in Fig. 2. The following theorem, a direct corollary of the previous equation and Lemma 2, shows that all instances of the framework achieve logarithmic regret. Theorem 1 Let 1, . . . , T be a sequence of functions such that for all t  [T ], t is t -strongly convex. Then, any algorithm that can be derived from Fig. 2 satisfies tT
=1

t(wt ) - min
w

tT
=1

t(w) 

T vt 2 1t 2 =1 1:t

(11)

where vt   t(wt ). Proof Let t be as defined in Eq. (6). The last condition in the algorithm implies that t  Dt+1 (t , . . . , t-1 , vt - t wt ) - Dt (t , . . . , t-1 ) . 1 t 1 t The proof follows directly by combining the above with Eq. (10) and Lemma 2. We conclude this section by deriving several algorithms from the framework. 5


Example 1 (Follow-The-Leader) As we have shown in Sec. 3, the FTL algorithm (Fig. 1) is equivalent to optimizing the dual variables at each online round. This update clearly satisfies the condition in Fig. 2 and is therefore a special case. Example 2 (Gradient-Descent) Following Hazan et al. [2006], Bartlett et al. [2007] suggested the following update rule for differentiable strongly convex function wt+1 = wt - 1 1:t t(wt ) . (12)

Using a simple inductive argument, it is possible to show that the above update rule is equivalent to the following update rule of the dual variables (t+1 , . . . , t+1 ) = (t , . . . , t-1 , 1 t 1 t t(wt ) - t wt ) . (13)

Clearly, this update rule satisfies the condition in Fig. 2. Therefore our framework encompasses this algorithm as a special case. Example 3 (Online Coordinate-Dual-Ascent) The FTL and the Gradient-Descent updates are two extreme cases of our algorithmic framework. The former makes the largest possible increase of the dual while the latter makes the smallest possible increase that still satisfies the sufficient dual increase requirement. Intuitively, the FTL method should have smaller regret as it consumes more of its potential earlier in the optimization process. However, its computational complexity is large as it requires a full blown optimization procedure at each online round. A possible compromise is to fully optimize the dual objective but only with respect to a small number of dual variables. In the extreme case, we optimize only with respect to the last dual variable. Formally, we let t if i < t i t+1 i = argmax Dt+1 (t , . . . , t , t ) if i = t 1 t-1
t

Clearly, the above update satisfies the requirement in Fig. 2 and therefore enjoys a logarithmic regret bound. The computational complexity of performing this update is often small as we optimize over a single dual vector. Occasionally, it is possible to obtain a closed-form solution of the optimization problem and in these cases the computational complexity of the coordinate-dual-ascent update is identical to that of the gradient-descent method.

5

Generalized strongly convex functions

In this section, we extend our algorithmic framework to deal with generalized strongly convex functions. We first need the following generalized definition of strong convexity. Definition 1 A continuous function f is  -strongly convex over a convex set S with respect to a norm ˇ if S is contained in the domain of f and for all v, u  S and   [0, 1] we have  f ( v + (1 - ) u)   f (v) + (1 - ) f (u) -  (1 - ) v - u 2 . (14) 2
1 It is straightforward to show that the function f (w) = 2 w 2 is strongly convex with respect to the 2 Euclidean norm. Less trivial examples are given below. n 1 Example 4 The function f (w) = i=1 wi log(wi / n ) is strongly convex over the probability simn plex, S = {w  R+ : w 1 = 1}, with respect to the L1 norm. Its conjugate function is n 1 f ( ) = log( n i=1 exp(i )).

Example 5 For q  (1, 2), the function f (w) = 2(q1 1) w 2 is strongly convex over S = Rn with q - respect to the Lq norm. Its conjugate function is f ( ) = 2(p1 1)  2 , where p = (1 - 1/q )-1 . p - For proofs, see for example Shalev-Shwartz [2007]. In the appendix, we list several important properties of strongly convex functions. In particular, the Fenchel conjugate of a strongly convex function is differentiable. 6


INPUT: A strongly convex function f F O R t = 1 , 2, . . . , T : 1) Define wt = f
,, t (t- - 1:t 1)

I N P U T: A  -strongly convex function f F O R t = 1, 2, . . . , T : Ť 1) Define wt = f
,,

-

t :(t-1) 1 1:t

Ť

2) Receive a function t 3) Suffer loss t(wt ) 4) Update t+1 , . . . , t+1 t 1 s.t. there exists t   lt (wt ) with Dt+1 (t+1 , . . . , t+1 )  t 1 Dt+1 (t , . . . , t-1 , t ) 1 t

2) Receive a function t =  f + gt 3) Suffer loss t(wt ) 4) Update t+1 , . . . , t+1 s.t. there t 1 exists t   gt (wt ) with Dt+1 (t+1 , . . . , t+1 )  t 1 Dt+1 (t , . . . , t-1 , t ) 1 t

Figure 3: Primal-dual template algorithmPor general online convex optimization (left) and online strongly sf convex optimization (right). Here a1:t = t=1 ai , and for notational convenient, we implicitly assume that i a1:0 = 0. See text for description. Consider the case where for all t, t can be written as t f + gt where f is 1-strongly convex with respect to some norm ˇ and gt is a convex function. We also make the simplifying assumption that t is known to the forecaster before he defines wt . For each round t, we now define the primal objective to be Pt (w) = 1:(t-1) f (w) + The dual objective is (see again Sec. 2) Dt (1 , . . . , t-1 ) = - 1:(t-1) f
- 1:(t-1) 1:(t-1) t-1 i =1

gi (w) .

(15)

- t-1 i
=1

gi (i ) .

(16)

An algorithmic framework for online optimization in the presence of general strongly convex functions is given on the right-hand side of Fig. 3. The following theorem provides a logarithmic regret bound for the algorithmic framework given on the right-hand side of Fig. 3. Theorem 2 Let 1, . . . , T be a sequence of functions such that for all t  [T ], t = t f + gt for f being strongly convex w.r.t. a norm ˇ and gt is a convex function. Then, any algorithm that can be derived from Fig. 3 (right) satisfies tT
=1

t(wt ) - min
w

tT
=1

t(w) 

T 1t v t 2 , 2 =1 1:t

(17)

where vt   gt (wt ) and ˇ

is the norm dual to ˇ .

The proof of the theorem is given in Sec. B

6

Summary

In this paper, we extended the primal-dual algorithmic framework for general convex functions from Shalev-Shwartz and Singer [2006], Shalev-Shwartz [2007] to strongly convex functions. The template algorithms are outlined in Fig. 3. The left algorithm is the primal-dual algorithm for general convex functions from Shalev-Shwartz and Singer [2006], Shalev-Shwartz [2007]. Here, f is the complexity function, (t , . . . , t ) are the dual variables at time t, and Dt (ˇ) is the dual objective 1 t 7


function at time t (which is a lower bound on primal value). The function f is the gradient of the conjugate function of f , which can be viewed as a projection  the dual variables back into the priof mal space. At the least aggressive extreme, in order to obtain T regret, it is sufficient to set i (for t all i) to be a subgradient of the loss  t(wt ). We then recover an online `mirror descent' algorithm [Beck and Teboulle, 2003, Grove et al., 2001, Kivinen and Warmuth, 1997], which specializes to gradient descent when f is the squared 2-norm or the exponentiated gradient descent algorithm when f is the relative entropy. At the most aggressive extreme, where Dt is maximized at ach round, we e t-1 have `Follow the Regularized Leader', which is wt = arg minw i=1 i(w) + t f (w). Intermediate algorithms can also be devised such as the `passive-aggressive' algorithms [Crammer et al., 2006, Shalev-Shwartz, 2007]. The right algorithm in Figure 3 is our new contribution for strongly convex functions. Any  strongly convex loss function can be decomposed into t =  f + gt , where gt is convex. The algorithm for strongly convex functions is different in two ways. First, the effective learning rate is 1 1 now 1:t rather than t (see Step 1 in both algorithms). Second, more subtly, the condition on the dual variables (in Step 4) is only determined by the subgradient of gt , rather than the subgradient of t. At the most aggressive end of the spectrum, where Dt is maximized at each round, we have the t-1 `Follow the Leader' (FTL) algorithm: wt = arg minw i=1 i(w). At the least aggressive end, 1 we have the gradient descent algorithm of Hazan et al. [2006] (which uses learning rate 1:t ). Furthermore, we provide algorithms which lie in between these two extremes -- it is these algorithms which have the potential for most practical impact. Empirical observations suggest that algorithms which most aggressively close the duality gap tend to perform most favorably [Crammer et al., 2006, Shalev-Shwartz and Singer, 2007b]. However, at the FTL extreme, this is often computationally prohibitive to implement (as one must solve a full blown optimization problem at each round). Our template algorithm suggests a natural compromise, which is to optimize the dual objective but only with respect to a small number of dual variables (say the most current dual variable) -- we coin this algorithm online coordinate-dual-ascent. In fact, it is sometimes possible to obtain a closed-form solution of this optimization problem, so that the computational complexity of the coordinate-dual-ascent update is identical to that of a vanilla gradient-descent method. This variant update still enjoys a logarithmic regret bound.

References
J. Abernethy, P. Bartlett, A. Rakhlin, and A. Tewari. Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the Nineteenth Annual Conference on Computational Learning Theory, 2008. P. L. Bartlett, E. Hazan, and A. Rakhlin. Adaptive online gradient descent. In Advances in Neural Information Processing Systems 21, 2007. A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31:167­175, 2003. J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Conference on Empirical Methods in Natural Language Processing, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. Journal of Machine Learning Research, 7:551­585, Mar 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3):173­210, 2001. E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex optimization. In Proceedings of the Nineteenth Annual Conference on Computational Learning Theory, 2006. J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1­64, January 1997. S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University, 2007. S. Shalev-Shwartz and Y. Singer. Convex repeated games and Fenchel duality. In Advances in Neural Information Processing Systems 20, 2006. S. Shalev-Shwartz and Y. Singer. Logarithmic regret algorithms for strictly convex repeated games. Technical report, The Hebrew University, 2007a. Available at http://www.cs.huji.ac.il/shais. S. Shalev-Shwartz and Y. Singer. A unified algorithmic approach for efficient online label ranking. In aistat07, 2007b.

8


A Fenchel duality
Consider the optimization problem f
w S

inf

(w ) +

tT
=1

. gt (w)

An equivalent problem is f
w0 ,w1 ,...,wT

inf

(w0 ) +

tT
=1

s gt (wt ) .t. w0  S and t  [T ], wt = w0 . (18)

Introducing T vectors 1 , . . . , T , each t  Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian L(w0 , w1 , . . . , wT , 1 , . . . , T ) = f (w0 ) + tT
=1

gt (wt ) +

tT
=1

t , w0 - wt

.

The dual problem is the task of maximizing the following dual objective value, D ( 1 , . . . ,  T ) = =
w0 S,w1 ,...,wT

inf

- sup
w0 S -

L(w0 , w1 , . . . , wT , 1 , . . . , T ) w - -T tT t t f (w0 ) sup ( wt , t - gt (wt )) 0, -
=1 =1 wt

=

-f

tT
=1

- t

tT
=1

gt (t ) ,

where f , g1 , . . . , gT are the Fenchel conjugate functions of f , g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is -T tT t - t gt (t ) . (19) sup - f
1 ,...,T =1 =1

Note that when T = 1 the above duality is the so-called Fenchel duality [Borwein and Lewis, 2006]. The weak duality theorem tells us that the primal objective upper bounds the dual objective: sup
1 ,...,T

-f (-

tT
=1

t ) -

tT
=1

g (t )  inf f (w) +
wS

tT
=1

gt (w) .

A sufficient condition for equality to hold (i.e. strong duality) is that f is a strongly convex function, g1 , . . . , gT are convex functions, and the intersection of the domains of g1 , . . . , gT is polyhedral. Assume that strong duality holds, let (1 , . . . , T ) be a maximizer of the dual objective function, and let (w0 , . . . , wT ) be a maximizer of the problem given in Eq. (18). Then, optimality conditions imply that (w0 , . . . , wT ) = argmin L(w0 , . . . , wT , 1 , . . . , T ) .
w0 ,...,wT

See for example Boyd and Vandenberghe [2004] Section 5.5.2. In particular, the above implies that w - tT 0 w = argmax t f (w0 ) = f (-1:T ) , 0, -
w0 S =1

where in the last equality we used Lemma 1. Since w0 is also a minimizer of our original problem, we obtain the primal-dual link function w= f (-1:T ) . 9


B

Proof of Thm. 2

We first use the following properties of the Fenchel conjugate of strongly convex functions. The proof of this lemma follows from Lemma 18 in Shalev-Shwartz [2007]. Lemma 3 Let f be a  -strongly convex function over S with respect to a norm ˇ . Let f Fenchel conjugate of f . Then, f is differentiable and for all  1 ,  2  Rn , we have
W b

e the

f ( 1 +  2 ) - f ( 1 ) 

f ( 1 ),  2 +

1 2 2 2

e also need the following technical lemma. Lemma 4 Assume f strongly convex, let a, b  0, and let wb = f ( /b). Then,

af ( /a) - bf ( /b)  (b - a)f (w) Proof Since f is strongly convex we know that f
( i

s differentiable. Using Lemma 1 we have

f  /b) = wb ,  /b - f (wb ) The definition of f
n

ow implies that
w

f ( /a) = max w,  /a - f (w)  wb ,  /a - f (wb ) Therefore, which concludes our proof. Next, we show that the gradient descend update rule yields a sufficient increase of the dual objective. Lemma 5 Let (1 , . .a. , t-1 ) be an arbitrary sequence of vectors. f
-1 1:t 1:(t-1)

af ( /a) - bf ( /b)  - (a - b)f (wb )

Denote w

=

nd let    gt (w). Then,

Dt+1 (1 , . . . , t-1 , ) - Dt (1 , . . . , t-1 )  t(w) -

2

2

1:t

.

Ż Proof Denote t = Dt+1 (1 , . . . , t-1 , ) - Dt (1 , . . . , t-1 ). Since f is strongly convex we can apply Lemma 3 to get that + - t + 1 (t-1) Ż t = -1:t f - 1:(-:1t) 1:(t-1) f - 1::(t-1) gt () 1 - + - f 2 1 (t-1) w, - 1:(t-1) 2 1:(t-1) f - 1::(t-1) gt ()  -1:t 1:t 1:t + (1:t )2 A- 2 1 (t-1) :t 1:t f - 1(1:-1) + w,  - gt () - 2 . = 1:(t-1) f - 1::(t-1) t B 1:t Since    gt (w) we et from Lemmaf1 that B = gt (w). Next, we use Lemma 4 and the definition g of w to get that A  1:t - 1:(t-1) (w) = t f (w). Thus, A + B  t f (w) + gt (w) = t(w) and this concludes our proof. The proof of Thm. 2 now easily follows. Proof [of Thm. 2] Denote t = Dt+1 (t+1 , . . . , t+1 ) - Dt (t , . . . , t-1 ) and note that 1 t 1 t Eq. (10) still holds. The definition of the update in Fig. 3 and Lemma 5 implies that there exists v2 vt   gt (wt ) such that t  t(wt ) - 2t1:t . Summing over t and combining with Eq. (10) we conclude our proof.

10