Implicit Online Learning Brian Kulis ICSI and Department of EECS, University of California, Berkeley, CA 94720, USA KULIS @ EECS . BERKELEY. EDU Peter L. Bartlett BARTLETT @ EECS . BERKELEY. EDU Department of EECS and Department of Statistics, University of California, Berkeley, CA 94720, USA Abstract Online learning algorithms have recently risen to prominence due to their strong theoretical guarantees and an increasing number of practical applications for large-scale data analysis problems. In this paper, we analyze a class of online learning algorithms based on fixed potentials and nonlinearized losses, which yields algorithms with implicit update rules. We show how to efficiently compute these updates, and we prove regret bounds for the algorithms. We apply our formulation to several special cases where our approach has benefits over existing online learning methods. In particular, we provide improved algorithms and bounds for the online metric learning problem, and show improved robustness for online linear prediction problems. Results over a variety of data sets demonstrate the advantages of our framework. sum of the losses of the best fixed predictor w . Regret of O( T ) has been shown for general online convex programming and O(log T ) for some special cases. Online learning updates are often motivated as balancing a tradeoff between "conservativeness" and "correctiveness:" when updating from step to step, we do not want to change our model (i.e., our vector wt ) very much, and yet we want to minimize the loss. Such a balance can naturally be modeled using a weighted sum of two terms, one for the conservativeness and one for the correctiveness; one can then compute an update as the minimizer of such a function. Despite this motivation, such updates have proven difficult to analyze in general, so alternative approaches have been proposed and analyzed. One approach linearizes the losses via a first-order Taylor expansion, which makes the analysis simpler but looser (e.g. Kivinen & Warmuth (1997) and others). A second approach employs an evolving notion of conservativeness, leading to typically expensive updates (e.g. the method of Azoury & Warmuth (2001)). Our goal in this paper is to analyze updates obtained by the non-linearized loss with a fixed notion of conservativeness (i.e. a fixed potential, or regularizer), leading to algorithms with implicit update rules. We motivate the study of these updates from practical problems, including non-negative linear regression and online metric learning. For example, a popular regularizer for metric learning is the LogDet divergence, and implicit methods involving this regularizer avoid restrictive assumptions and outperform other methods in practice. Implicit updates are further desirable in special cases where closed-form solutions are available, or where the updates can be computed easily; we discuss several such examples in the paper. To our knowledge, implicit updates, while used ubiquitously to motivate online learning, have not been analyzed in general. As we will see, such updates are slightly more computationally expensive to compute than fixed potential methods with linearized losses (since the resulting implicit updates cannot be computed in closed form), but less expensive than time-varying potential methods. To summarize the main contributions of our work: · We show O( T ) regret for arbitrary Bregman diver- 1. Introduction In its most general form, online convex programming (Cesa-Bianchi & Lugosi, 2006; Zinkevich, 2003) can be described as follows: a player chooses some point wt from a convex set at each iteration. A convex loss function t is revealed, and the player pays loss t (wt ). This is repeated for T timesteps, resuling in a total loss Examples of problems that fall within of t t (wt ). this framework include online portfolio management, online linear regression, online classification, and many others (see Cesa-Bianchi & Lugosi (2006) for an overview). Much of the recent success in online learning has been in developing algorithms that are provably competitive with offline counterparts. Typically, we measure the quality of an online learning method in terms of the regret: the difference between the sum of the losses t t (wt ) and the Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Implicit Online Learning gence regularizers for the linear prediction problem, comparable to existing bounds. · We show O(log T ) regret when the loss functions are strongly convex and the potential is (x) = 1 x 2 , 2 2 matching existing bounds up to terms independent of T. · For LogDet online metric learning, our analysis yields O( T ) regret under a variety of constraints; previous bounds for the LogDet regularizer were not O( T ). · We show results indicating that the implicit updates are desirable for linear regression and non-negative linear regression problems. fixed strategy. We denote the regret as T T RT = t=1 t (wt ) - min wF t (w). t=1 Typically, we denote the optimal fixed strategy from F as w , and so RT = t (t (wt ) - t (w )). Here, T is the total number of time steps. To illustrate with a simple example, consider online linear prediction. The loss function at each step is given by T t (wt ) = 1 (wt xt - yt )2 . That is, we may view the algo2 rithm as receiving a data point xt and response yt at every iteration, with the goal of finding a good vector of weights to minimize the squared loss between a linear combination of the weights over the data points and the response variables. After paying the loss t (wt ) at each iteration, the algorithm uses information about wt , xt , and yt (and potentially information from the previous timesteps) to update the vector of weights to wt+1 . 2.3. Explicit Updates for OCP Algorithmically, a standard approach to designing online convex programming algorithms has been to update weights as a tradeoff between conservativeness and correctiveness. Loosely speaking, we do not want to change our vector wt "too much" at each timestep, while at the same time we want the loss over the updated vector to be small. More formally, we can define the following regularized loss function: ft (w) = D (w, wt ) + t t (w). (1) 2. Online Convex Programming: Background and Related Work We begin with a discussion on the necessary mathematical background and related work for online convex programming (OCP). 2.1. Mathematical Background Recall that a function f is convex if, for any two points x and y in its domain and (0, 1): f (x + (1 - )y) f (x) + (1 - )f (y). The function is strictly convex if the inequality holds strictly. The convex conjugate f of a convex function f on a convex domain S is defined as f (x ) = sup ( x , x - f (x)) . xS We consider functions of Legendre type (see Rockafellar (1997)), which implies that the gradient map f is defined on int domf and is an isomorphism between int domf and int domf . For a function of Legendre type, denote the Bregman divergence with respect to as D (x, y) = (x) - (y) - (x - y)T (y). Examples of Bregman divergences include the squared Euclidean distance 1 x - y 2 , arising from the func2 2 1 tion (x) = 2 x 2 . Other examples include the rel2 ative entropy (or KL-divergence) arising from (x) = i (x(i) log x(i) - x(i)), or the LogDet divergence, arising in the matrix domain from (X) = - log det X. 2.2. OCP Formulation In online convex programming, the player (algorithm) chooses a point from a fixed convex set F , which we will denote as wt , during each timestep t. After making this choice, the convex loss function t is revealed and the player pays loss t (wt ). The goal is to minimize the total loss t t (wt ). We compare against the loss of the best The first term (the "conservativeness" term) measures the divergence between w and the current vector wt via a Bregman divergence, while the second term gives the loss incurred by w (and measures "correctiveness"). t is called the learning rate, and governs the tradeoff between these two terms. The update for computing wt+1 is then defined by the minimizer of (1) projected onto F , that is, ~ wt+1 wt+1 = argminwdom ft (w) ~ = argminwF D (w, wt+1 ) (2) It has proven difficult to analyze such updates in general (see, e.g., the discussion in Kivinen & Warmuth (1997)); as a result, two related approaches have been proposed. The first assumes that the loss functions t are linear, and if they are not, the loss functions are linearized via the firstorder Taylor approximation about wt . In particular, we can approximate t (w) as t (w) t (wt ) + t (wt )(w - wt ), and since the regret with the original loss functions is less than or equal to the regret with the approximated loss functions (due to convexity of the loss), we can bound the regret with the approximate functions to obtain a bound on Implicit Online Learning the regret with the original loss functions. See Kivinen & Warmuth (1997); Zinkevich (2003); Hazan et al. (2007) for examples of this approach. Assuming linear loss functions in this manner greatly simplifies both the mechanics of the algorithm, and its analysis. The updates can generally be computed in closed form by computing the gradient of ft with the linearized loss functions and solving for w. Noting that w D (w, wt ) = w ((w) - (wt ) - (w - wt )T (wt )) = (w) - (wt ), and t (w) = t (wt ) for linearized losses, we can solve ~ for wt+1 by setting the gradient to 0 as ~ wt+1 = ((wt ) - t t (wt )) when is of Legendre type. Such updates have been analyzed for arbitrary Bregman divergence regularizers (CesaBianchi & Lugosi, 2006). Assuming only that the loss functions are convex, it has been shown that the regret is bounded by O( T ). If the loss functions are strongly convex, the regret for (x) = 1 x 2 has been shown to be 2 2 bounded by O(log T ) (Hazan et al., 2007). Another approach, which we will not consider in detail in this paper, has been to use a time-varying potential approach. We define the function ft as ft (w) = Dt (w, wt ) + t t (w), and so the Bregman divergence, our notion of conservativeness, changes from iteration to iteration (typically t = 0 + t t t ). Algorithms in this class include the Follow the Regularized Leader technique (e.g., Kakade & Shalev-Shwartz (2008)) and the elliptic potential method (Azoury & Warmuth, 2001). Such methods have shown logarithmic regret for a wider class of loss functions, including exp-concave loss functions. However, the updates for these methods are more expensive (in the vector case, typically O(d2 ) as opposed to O(d) where d is the dimensionality of the data), and the regret bounds have a linear dependence on the dimensionality. Burg entropy function (x) = - i log x(i). A related scenario arises in the matrix case for online metric learning (Jain et al., 2009), where (X) = - log det X and 1 the loss is 2 (tr(WtT Xt ) - yt )2 ; such a regularizer has been shown to be useful for learning positive semi-definite matrices due to LogDet's various intrinsic properties. In the vector case, the Bregman divergence corresponding to the Burg entropy is called the Itakura-Saito divergence: D (x, y) = i ( x(i) - log x(i) - 1). It is a useful regy(i) y(i) ularizer for non-negative linear regression, as its domain is limited to positive-valued vectors but, unlike the relative entropy, those vectors are not typically restricted to lie over a unit simplex. It is straightforward to show that the updates using the linearized loss in the case of Burg entropy regularization are given by ~ wt+1 (i) = 1+ T t (wt xt wt (i) . - yt )wt (i)x(i) As noted in Jain et al. (2009), this update can ~ cause elements of wt+1 to be negative (when T t (wt xt - yt )wt (i)xt (i) < -1), which is outside of the domain of . Hence, the analysis breaks down, and restrictive assumptions must be placed over the domain of the inputs and/or the step sizes. In contrast, Jain et al. (2009) derived an implicit update in closed form in the matrix domain (i.e. (X) = - log det X) for a class of losses that arise in online metric learning. Unlike the linearized case, updates using non-linearized losses always remain in the domain of , making analysis and computation simpler. Jain et al. (2009) proved a simple regret bound for this case, but it is not O( T ); we will refine and generalize their approach to a larger class of online LogDet metric learning algorithms with O( T ) regret. Closed-Form Special Cases. Applying an update with a non-linearized loss is clearly desirable in cases where a closed-form update is possible since the inherent approximation introduced by the linearized loss is removed, without any additional cost for the update. In addition to the case noted above, one can also perform implicit gradient descent updates in closed form. Take the example with 1 1 T (x) = 2 x 2 , t (wt ) = 2 (wt xt - yt )2 , and F = Rd . 2 The update using the linearized loss is the standard gradient descent update: T wt+1 = wt - t (wt xt - yt )xt . 3. Implicit Updates for OCP Our goal in this work is to study the updates arising from solving ft directly as in (2). As we will see in the next section, because we do not linearize the loss functions, the resulting updates cannot always be computed in closed form, hence we call them implicit updates. 3.1. Motivation Given that there has been a wide body of existing work on online learning, why even bother with studying implicit updates? Below we sketch out a few reasons. Non-negative Linear Regression. Consider the case of 1 T linear prediction (t (wt ) = 2 (wt xt - yt )2 ) using the Updating using the original loss t and solving (2) results in the closed-form update t (wT xt - yt )xt . wt+1 = wt - 1 + t xt 2 t 2 The updates are nearly identical, except that the implicit update can be viewed as using a modified learning rate Implicit Online Learning t /(1 + t xt 2 ). Terms of this form will appear in our 2 1 analysis. Other cases when (x) = 2 x 2 also yield 2 closed-form updates. Robustness. Empirically, implicit updates outperform or nearly match the performance of explicit updates in general. Furthermore, the implicit methods appear to be more robust to scaling of the data. Using the gradient descent example from above, the update naturally factors in the scale of the input vector xt when computing the update, which may help explain such robustness. We will show some examples of this phenomenon in our experiments. 3.2. Computing the Updates The updates for the implicit online algorithm are obtained by solving (2) directly. Using the definition of the Bregman divergence, the gradient of ft with respect to w simplifies to (w) - (wt ) + t t (w), ~ and so setting this to zero to compute wt+1 yields ~ ~ the expression (wt+1 ) = (wt ) - t t (wt+1 ). ~ Since is of Legendre type, wt+1 = ((wt ) - ~ t t (wt+1 )) (Rockafellar, 1997). This is an implicit up~ date since wt+1 appears on both sides. For cases when the update cannot be computed in closed form, a root finding method must be employed, but in many practical cases, fast root finding can be performed with only a few functions evaluations. For example, in the case of linear prediction 1 (t (w) = 2 (wT xt - yt )2 ), we can use a root finder to ~T solve for the inner product yt = wt+1 xt , resulting in the ¯ following equation: ((wt ) - t (¯t - yt )xt )T xt - yt = 0. y ¯ The resulting computation is similar to computing a Bregman projection (Censor & Zenios, 1997). Typically, good implementations require only a few choices for yt (using ¯ a bisection method or more sophisticated approach); for example, for the relative entropy and von Neumann divergence, one can adapt the root finding method discussed in Kulis et al. (2009) for computing Bregman projections. In these cases, the typical running time of the algorithm is maintained at O(d), making the updates competitive with the explicit updates. 3.3. A General Stepwise Lemma and Regret Bound Let us bound the regret of our implicit online algorithm by analyzing how the online solution compares with the optimal solution from step to step. Lemma 3.1. [Stepwise Lemma] Using updates defined by (1) and (2), at each step t, t t t (wt ) - t t (w ) D (w , wt ) - D (w , wt+1 ), if t satisfies t solution. ~ ft (wt+1 ) ft (wt ) and w is the optimal offline Proof. We will prove that the stepwise lemma holds with ~ wt+1 in place of wt+1 . Then, noting that D (w , wt ) - ~ D (w , wt+1 ) = [D (w , wt ) - D (w , wt+1 )] + ~ [D (w , wt+1 ) - D (w , wt+1 )] D (w , wt ) - ~ ~ D (w , wt+1 ) + D (wt+1 , wt+1 ) D (w , wt ) - ~ D (w , wt+1 ), where the first inequality follows by the generalized Pythagorous inequality (see Censor & Zenios (1997), Thm. 2.4.1), the stepwise lemma will hold for wt+1 as well. Using the definition of Bregman divergences and the fact ~ ~ that (wt+1 ) = (wt ) - t t (wt+1 ), straightforward algebra verifies that ~ ~ ~ = D (wt+1 , wt ) + t t (wt+1 )T (wt+1 - w ). Therefore, we will prove the lemma by showing that for appropriate t , the following holds: ~ ~ -t t (wt+1 )T (wt+1 - w ) 0. The convexity of t implies the following: ~ ~ ~ t (w ) t (wt+1 ) + (w - wt+1 )T t (wt+1 ). (3) ~ t t t (wt ) - t t (w ) - D (wt+1 , wt ) ~ D (w , wt ) - D (w , wt+1 ) We rearrange and plug into the above equation (after multiplying by t ), and from this we know that the lemma is true if ~ ~ t t t (wt ) - t t (wt+1 ) - D (wt+1 , wt ) 0, (4) ~ or equivalently, t ft (wt ) - ft (wt+1 ) 0. The lemma follows. We can use this lemma to prove the following regret bound. It has two pieces. The second, the almost-telescoping sum of divergences, is standard. The first is more unusual: it is small whenever the relative drop in value of the criterion ft after the implicit update is not too large. In the next section, we shall see that the following theorem implies O( T ) and O(log T ) regret for several specific cases. Theorem 3.2. Let RT = t t (wt ) - t t (w ) be the regret of the implicit online learning algorithm. Then, for ~ t ft (wt+1 )/ft (wt ), we have RT t 1 (1-t )t t (wt )+D (w , wt )-D (w , wt+1 ) . t Proof. We add (1 - t )t t (wt ) to both sides of the stepwise lemma, and then divide the inequality by t . Finally, sum over all T timesteps. Implicit Online Learning 4. Special Cases In the following, we will look in depth at some special cases of interest--the case of linear prediction, and the case when the loss functions are strongly convex. For linear prediction, we give a general result over all Bregman divergence regularizers that is comparable to known regret bounds, and discuss specific examples in detail. We show for strongly convex loss functions how to achieve logarithmic regret, nearly matching existing bounds for this case. 4.1. Linear Prediction For linear prediction, our loss function is given by 1 T t (wt ) = 2 (wt xt - yt )2 . Note that 2 t (wt ) = xt xT , t and is not strongly convex. First we simplify the expression for t : - yt )2 , and reguLemma 4.1. For t (wt ) = larization with an arbitrary strictly convex function of Legendre type, choosing t = 1 1 + t t = 1 T 2 (wt xt and summing over all timesteps yields RT t t (wt ) 2 2 St t + t 1 (D (w , wt ) - D (w , wt+1 )) t t + 1 (D (w , wt ) - D (w , wt+1 )). t G 2 t t Setting t D/(tG), we can simplify the second sum= mation to GDT since the sum telescopes. The first sum T 1 simplifies using t=1 t 2 T - 1 to obtain the result RT 2 DGT . Note that the simpler t = t-1/2 (as in Zinkevich (2003)) also yields O( T ) regret, but the result above is tighter. Moreover, we can improve the analysis by not simplifying via (t t )/(1 + t t ) t t as is done in the proof. In this case, we achieve regret of the form T - log T when choosing t = O(t-1/2 ). 4.2. Linear Prediction Examples satisfies the conditions of Lemma 3.1, where t ~ xT [2 (wt+1 )]-1 xt . t The proof appears in the appendix. To prove a regret bound, we sum up the stepwise lemma over all T timesteps using the above choice of t , and note the telescoping sum. The proof is similar to Zinkevich (2003) but is included here to highlight differences in the bounds. For notational simplic~ ity, let x 2 t = xT St x, where St = [2 (wt+1 )]-1 . S Theorem 4.2. Let RT = t=1 t (wt ) - t=1 t (w ) be the regret of the implicit online learning algorithm with a strictly convex function of Legendre type. Suppose there are constants G and D such that for all wt and St , t (wt ) 2 t G and D (w , wt ) D. Then, choosing S t = O(t-1/2 ), we have RT = O( T ). Proof. We take the stepwise lemma result, and add (1 - t )t t (wt ) to both sides, then divide by t . This yields t (wt ) - t (w ) + t t t (wt ) 1 + t t T T We discuss three particular special cases in this section: the squared Euclidean distance, the relative entropy, and the Itakura-Saito divergence. Since t is indirectly a function of t , and therefore t, we also show bounds on t for each of these cases (alternatively, we can say that t is the maximum over all vectors x, w of xT [2 (w)]-1 x in the domain of , but this may be a weak estimate of t ). Squared Euclidean Distance: The squared Euclidean 1 distance is generated by the function (x) = 2 x 2 ; 2 this divergence is particularly desirable because the exact gradient update can be computed in closed form for linear prediction. Furthermore, we have (x) = x, and 2 = I. Therefore, t = xt 2 . Given that t = xt 2 , 2 2 we have t t (wt ) = 1 t (wt ) 2 . 2 2 Relative Entropy: Now consider the Bregman divergence obtained by the generating funcIn this case, tion (x) = i x(i) log x(i). 2 ~ 3 xt 2 . The equality t = i wt+1 (i)xt (i) ~ ~ follows by noting that 2 (wt+1 ) = diag(1/wt+1 ), where the division is elementwise. The proof of the inequality is omitted due to lack of space. 1 (D (w , wt ) - D (w , wt+1 )). t Noting that (t t )/(1 + t t ) t t , we have t (wt ) - t (w ) Itakura-Saito and LogDet: A third special case arises in the case of non-negative linear regression. In the vector case, the potential is called the Burg entropy and is defined 1 by (x) = - i log x(i); in the matrix case, the potential t t t (wt ) + (D (w , wt ) - D (w , wt+1 )) is defined as (X) = - log det X. It is straightforward to t ~ show that t = i wt+1 (i)2 xt (i)2 for the vector case, and t 1 2 = t (wt ) St + (D (w , wt ) - D (w , wt+1 )), = tr((X W )2 ) for the matrix case. ~ t+1 t t 2 t Implicit Online Learning To bound t for the vector case, we let F be the set of vectors whose elements are less than or equal to 1. Let R be the maximum value of xt 1 ; we will also assume that yt T [-R, R] (since w xt [-R, R] by H¨ lder's inequality). o T The projection step onto F guarantees that yt = wt xt ^ ~ [-R, R] for all t. Since |¯t - yt | |^t - yt | (as the wt+1 y y update minimizes (2)), we can further conclude that yt ¯ ~ [-3R, 3R]. Finally, let v(i) = wt+1 (i)xt (i). Then the following holds: t = v 2 2 Lemma 4.4. For (x) = 1 x 2 and t such that hI 2 2 2 t is an upper bound on the Hessian of t , the following holds: ~ ft (wt ) - ft (wt+1 ) 1 + t h ~ wt+1 - wt 2 . 2 2 v 2 1 9R2 . See the proof in the appendix. The stepwise lemma ensures that the second term in the regret cancels when choosing t = 1/(Ht), and the lemma above ensures that the first term of the regret is bounded logarithmically. Theorem 4.5. Let RT = t t (wt ) - t t (w ) be the regret of the implicit online learning algorithm. Given that 1 (x) = 2 x 2 and t is H-strongly convex, then choosing 2 t = 1/(Ht) yields RT = O(log T ) when running for a total of T timesteps. The proof appears in the appendix. Comparing with the existing proof for explicit updates with strongly convex functions of Hazan et al. (2007), we see that the existing bound G is RT 2H (1 + log T ), where t 2 G. Our bound is identical for the terms dependent on T , but has an additional additive term dependent on the maximum condition number of the Hessian of t . Thus, t 9R2 . An analogous result holds for the matrix case. 4.3. Logarithmic Regret with Strongly Convex Losses A second class of loss functions that have been widely studied assumes strong convexity of the loss; that is 2 t HI, for some scalar H. An example would be the loss 1 function t (wt ) = 2 wt - xt 2 , whose second derivative is simply the identity (i.e., H = 1). Strongly convex losses are useful and widely studied because they have been shown to yield online learning algorithms with logarithmic regret bounds (e.g. Hazan et al. (2007); Kakade & Shalev-Shwartz (2008)). Here we will 1 show that when choosing (x) = 2 x 2 , implicit up2 dates achieve logarithmic regret for any strongly convex loss function. Our resulting bounds will be comparable to existing bounds; the first step is to prove a stronger version of the stepwise lemma, stated below. Lemma 4.3. [Stepwise Lemma for H-Strongly Convex Losses] At each step t, t t t (wt ) - t t (w ) D (w , wt ) - D (w , wt+1 ) - where t ~ ft (wt+1 ) ft (wt ) 5. Application: Online Metric Learning So far, we have seen that implicit online learning updates can be applied in a variety of settings, yielding competitive regret bounds as compared to state-of-the-art explicit methods. To highlight an application of implicit updates for a practical problem, we now focus on a domain where implicit updates are particularly useful--online metric learning. As we discuss below, and as the empirical results indicate, the LogDet divergence as a regularizer has shown superior performance in this domain; further, as discussed in Section 3.1, an implicit update scheme is necessary. Our analysis yields better bounds and a more general algorithm. The goal in online Mahalanobis metric learning is to learn a positive semi-definite matrix W that paramaterizes the socalled Mahalanobis distance: dW (x, y) = (x-y)T W (x- y). We will update W in an online manner; each iteration, we receive some information about the desired Mahalanobis distance, which is encoded in the loss func1 tion t (Wt ) = 2 (tr(Wt Xt ) - yt )2 . For example, when Xt = (xt - yt )(xt - yt )T , then tr(Wt Xt ) is simply the Mahalanobis distance between xt and yt using Wt , and so the loss encodes how close the Mahalanobis distance is to the target distance yt . Note that in this case, Xt is a rankone matrix. Another possibility, used frequently in the offline setting, is that of relative distance constraints. Various regularizers are possible in the matrix domain. Examples include the squared Frobenius norm D (X, Y ) = 1 2 2 X - Y F , the von Neumann divergence D (X, Y ) = tr(X log X -X log Y -X +Y ), or the LogDet divergence, t H w - wt+1 2 , 2 2 and w is the optimal offline solution. Proof. The proof follows the stepwise lemma proof, except in place of (3), we use the following inequality arising from strong convexity (see Boyd & Vandenberghe (2004), Section 9.1.2): t (w ) ~ ~ ~ t (wt+1 ) + (w - wt+1 )T t (wt+1 ) H ~ w - wt+1 2 . + 2 2 We also use the generalized Pythagorous theorem for the additional t2H w - wt+1 2 term, analogous to the step2 wise lemma. The remainder of the proof is the same. Next, we bound the difference in ft obtained from updating wt to wt+1 using an upper bound on the Hessian of t : Implicit Online Learning discussed earlier. Because the constraints for online metric learning are typically low-rank (i.e., the matrices Xt are low-rank), the implicit LogDet updates can be computed in closed form (see Jain et al. (2009) for the case of similarity and dissimilarity constraints) in O(d2 ) time, where d is the number of rows/columns of Wt . The von Neumann divergence updates require an eigendecomposition in practice, resulting in updates that cost O(d3 ), and because the Wt matrices are restricted to the positive semi-definite cone, the squared Frobenius norm updates must project onto the PSD cone, which may be potentially expensive. Combined with the fact that LogDet has produced good empirical results for online metric learning, the LogDet divergence is a natural regularizer for the online metric learning problem. The previous work of Jain et al. (2009) considered only the case of similarity and dissimilarity constraints, and the analysis depended on properties of rank-one matrices. In contrast, the analysis in this paper applies in general to other constraints and regularizers for the online metric learning problem. Furthermore, our bounds for the case considered in Jain et al. (2009) are stronger; their bounds are not of the form O( T ), even if the stepsize for t is defined to be O(1/ t). 0.5 von Neumann von Neumann Imp. Frobenius Frobenius Imp. LogDet 0.4 k-NN Error 0.3 0.2 0.1 0 Iris Balance Scale Soybean Car Audiology Figure 2. Comparison of metric learning methods on standard UCI data sets. ods was always significantly higher due to the restriction that the weights lie on the unit simplex. See the final plot of Figure 1 for an example; note that both EGD-based methods performed similarly in this experiment and it is difficult to distinguish them on the plot. Online Metric Learning. We applied our method to the online metric learning problem over a set of UCI data sets. We compared various regularizers for metric learning, using similarity and dissimilarity constraints (see Section 5 for a description of the problem). For this experiment, we ran the following methods: a von Neumann regularized explicit method, a von Neumann regularized implicit method, a squared Frobenius regularized explicit method, a squared Frobenius regularized implicit method, and the LogDet regularized implicit method. For each algorithm, we ran 10,000 online timesteps and constraints were generated randomly over a 70 percent training set. Each constraint is a random pair of points from the training set, and target distances are set to the 5th and 95th percentile of the training set for same-class and different-class constraints, respectively. Results are averaged over 10 runs. We see from the results in Figure 2 that the LogDet method clearly outperforms the other online regularizers and algorithms on these data sets. Furthermore, given the additional cost of the von Neumann-based methods (each iteration is O(d3 ) whereas the cost of an iteration for the LogDet-based method is O(d2 )), the requirement that the Frobenius-based methods project onto the positive semidefinite cone, and the fact that LogDet cannot be effectively analyzed in the explicit case, these results further validate our analysis of implicit methods. 6. Experimental Results Synthetic Results. We begin with some simple synthetic experiments to demonstrate advantages of the implicit approach. First, we compare standard gradient descent with the implicit gradient descent algorithm for linear prediction. An optimal 20-dimensional vector w was chosen randomly (with uniform weights in [0, 1]). For each timestep, we chose random xt vectors with weights uniformly random from [-.5, .5] and computed a yt value as the inner product between the optimal w and xt , with added Gaussian noise (variance .1). We compared the methods (using t = 1/ t), and then repeated the experiment by scaling the xt vectors (and the noise) by 2 and 2.5 (but keeping the same learning rate). The results in the top row of Figure 1 show that as the scale increases, the standard gradient descent makes larger mistakes during early timesteps, whereas the implicit method appears more robust during these timesteps. As a result, the difference in the total accumulated loss between the methods grows with the scale of the data. Next, we compared exponentiated gradient descent, exponentiated gradient descent with implicit updates, and the Itakura-Saito based descent method. As before, we construct xt vectors with weights chosen uniformly at random from [-.5, .5]. When the optimal weight vector is normalized to sum to 1, we found that the exponentiated gradient descent methods and the Itakura-Saito method all gave nearly identical results (plots not shown). We also tested cases where the optimal weight vector was not normalized to sum to 1. For this case, the loss for the EGD-based meth- References Azoury, K. S. and Warmuth, M. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43(3):211­246, 2001. Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004. Censor, Y. and Zenios, S. Parallel Optimization. Oxford University Press, 1997. Implicit Online Learning Scale = 1 60 Accumulated loss Accumulated loss Gradient Descent Gradient Descent Imp. 40 400 Accumulated loss 300 200 100 0 0 2000 4000 Number of time steps 6000 Gradient Descent Gradient Descent Imp. Scale = 2 1000 800 600 400 200 0 0 2000 4000 Number of time steps 6000 Gradient Descent Gradient Descent Imp. Accumulated loss Scale = 2.5 3000 Exp. Gradient Descent Exp. Gradient Descent Imp. Itakura-Saito No Noise, Opt. Unnormalized 2000 20 1000 0 0 2000 4000 Number of time steps 6000 0 0 5000 10000 Number of time steps Figure 1. Synthetic results. The first 3 plots show how the implicit gradient descent algorithm is more robust as the scale of the data is increased; the last plot shows that, for non-negative linear regression, the Itakura-Saito methods (which require implicit analysis) are more suited than exponentiated gradient descent for problems where the optimal weight vector is not normalized. See text for details. Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006. Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69: 169­192, 2007. Jain, P., Kulis, B., Dhillon, I., and Grauman, K. Online metric learning and fast similarity search. In Advances in Neural Information Processing Systems (NIPS), 2009. Kakade, S. and Shalev-Shwartz, S. Mind the duality gap: Logarithmic regret algorithms for online optimization. In Advances in Neural Information Processing Systems (NIPS), 2008. Kivinen, J. and Warmuth, M. K. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1­64, 1997. Kulis, B., Sustik, M., and Dhillon, I. Low-rank kernel learning with Bregman matrix divergences. Journal of Machine Learning Research (JMLR), 10, 2009. Rockafellar, R. T. Convex Analysis. Princeton University Press, 1997. Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning (ICML), 2003. which results in the condition t 1 - (d¯t )/(dyt ). y ~ From Section 3.2, we know that wt+1 = ((wt ) - ~T ~T ¯ (wt+1 xt - yt )xt ) and so therefore since yt = wt+1 xt , we have yt = ((wt ) - (¯t - yt )xt )T xt . Simple ¯ y application of the chain rule yields d¯t y d¯t y y = -t xT [2 ((wt )-t (¯t -yt )xt )]xt -1 . t dyt dyt Using 2 ((x)) = [2 (x)]-1 and t ~ xT [2 (wt+1 )]-1 xt , we simplify to t t t d¯t y = , dyt 1 + t t and simplification of t yields the lemma. Proof: [Lemma 4.4] An upper bound of hI on 2 t (Boyd & Vandenberghe (2004), Section 9.1.2) yields t (wt ) ~ ~ ~ t (wt+1 ) + (wt - wt+1 )T t (wt+1 ) + = Appendix: Proofs Proof: [Lemma 4.1] Recall from the stepwise lemma that ~ t must satisfy t ft (wt ) ft (wt+1 ) or, equivalently, ~ t ft (wt ) - ft (wt+1 ) 0. Define the left-hand side using (4) as a function of yt ; that is h(yt ) = t t T t T ~ ~ (wt xt - yt )2 - (wt+1 xt - yt )2 - D (wt+1 , wt ). 2 2 h ~ wt - wt+1 2 . 2 2 ~ ~ Since wt+1 = wt - t t (wt+1 ), the term 1 ~ ~ ~ (wt - wt+1 )T t (wt+1 ) simplifies to t wt+1 - wt 2 . 2 Multiplying the whole equation by t and simplifying via 1 ~ ~ ~ ft (wt+1 ) = 2 wt+1 - wt 2 + t t (wt+1 ), the result 2 follows. Proof: [Theorem 4.5] By summing over all T timesteps as in Theorem 3.2, using Lemma 4.3 we obtain RT 1 ( w - w t 2t 2 2 We will show that the derivative of h is zero somewhere (and h = 0 at that point), and that the second derivative is always negative for a specific choice of t , implying that ~ h 0 everywhere. Since ft / wt+1 = 0, we know from the multivariate chain rule that when taking the derivative ~ of h with respect to yt , we can treat wt+1 as a constant. That leads to (dh)/(dyt ) = -t t (^t - yt ) + t (¯t - yt ), y y T ~T ^ where yt = wt xt and yt = wt+1 xt . When yt = yt , then ^ ¯ ~ loss is zero and so wt+1 = wt , h = 0, and h = 0. Now we must compute the second derivative of h with respect to t and show that this is always negative. The 2nd derivative is computed as (noting that yt is a function of yt ) ¯ t t + t d¯t y -1 , dyt t 1 ~ (ft (wt ) - ft (wt+1 )) + t 2 2 t - w - wt+1 - t H w - wt+1 2 ). 2 By choosing t = 1/(Ht), the second sum simplifies to 1 2 21 w - w1 2 . For the first sum, we use Lemma 4.4 and 2 2 ~ ~ the fact that wt+1 - wt 2 = t t (wt+1 ) 2 t G, 2 2 where t 2 G. This gives us RT G t 2 t + t h HD + 2 2 G 1 + log T 2H + Gh HD + . H2 2