Efficient learning algorithms for changing environments Elad Hazan ehazan@cs.princeton.edu C. Seshadhri csesha@cs.princeton.edu IBM Almaden Research Center, 650 Harry Rd., San Jose, CA 95120 USA Abstract We study online learning in an oblivious changing environment. The standard measure of regret bounds the difference between the cost of the online learner and the best decision in hindsight. Hence, regret minimizing algorithms tend to converge to the static best optimum, clearly a suboptimal behavior in changing environments. On the other hand, various metrics proposed to strengthen regret and allow for more dynamic algorithms produce inefficient algorithms. We propose a different performance metric which strengthens the standard metric of regret and measures performance with respect to a changing comparator. We then describe a series of datastreaming-based reductions which transform algorithms for minimizing (standard) regret into adaptive algorithms albeit incurring only poly-logarithmic computational overhead. Using this reduction, we obtain efficient low adaptive-regret algorithms for the problem of online convex optimization. This can be applied to various learning scenarios, i.e. online portfolio selection, for which we describe experimental results showing the advantage of adaptivity. Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). 1. Introduction In online optimization the decision maker iteratively chooses a decision without knowledge of the future, and pays a cost based on her decision and the observed outcome. The game theory and machine learning literature has produced a host of algorithms which perform nearly as well as the best single decision in hindsight. Formally, the average regret of the online player, which is the average difference between her cost and the cost of the best strategy in hindsight, approaches zero as the number of game iterations grows. In scenarios in which the environment variables are sampled from some (unknown) distribution, regret minimization algorithms effectively "learn" the environment and approach the optimal strategy. However, if the underlying distribution changes, no such claim can be made. When the environment undergoes many changes, regret may not be the best measure of performance. Various researchers noted the "static" behavior of regret-minimizing algorithms and generalized the concept of regret to allow for a changing prediction strategy (Herbster & Warmuth, 1998; Freund et al., 1997; Lehrer, 2003; Blum & Mansour, 2007). Although of great generality and importance to various scenarios, previous research fails to provide for efficient algorithms for continuous online optimization problems, in particular portfolio management. In this paper we aim to strike a balance between efficiency and adaptivity: we define a new measure of regret called adaptive regret , which is less general than some previous approaches, but general enough to capture intuitive notions of adap- Efficient learning algorithms for changing environments tivity. We then then use sketching and data streaming techniques to design efficient learning algorithms. Our main result is an efficiency preserving reduction which transforms any low regret algorithm into a low adaptive regret algorithm. Adaptive regret deals with the behavior of learning algorithms on contiguous intervals, which very intuitively captures how well it tracks the progress of the environment. We give several variants of online convex optimization algorithms whose cost on any contiguous sequence of iterations is within a small additive term with respect to the local optimum - i.e. the cost of the optimal decision on this particular sequence of iterations. In contrast, low regret guarantees small difference in cost only over the entire sequence of iterations and compared to the global optimum. In particular, we give the first efficient adaptive algorithm for online portfolio management. 1.1. Formal statement of results In online convex optimization, in each round t = 1, 2, ..., the decision maker plays a point xt from a convex domain K Rn . A convex loss function ft is presented, and the decision maker incurs a loss of ft (xt ). The standard performance measure is regret, which is the difference between the loss incurred by the online player using algorithm A and the best fixed optimum in hindsight: T T A crucial point in the above definition is that the comparison in cost is with respect to a different optimum for any interval, i.e. x can vary arbiI trarily with I. Intuitively, this quantity measures on every interval of time how well we performed, compared to the optimum in hindsight for that interval. Ideally, we want the adaptive regret to be strongly sublinear. Why is attaining low adaptive regret difficult? Here is a simple explanetory example in which current algorithms fail: Consider the simple example of square loss in one dimension. In each round, we play a point xt [-1, 1]. The loss function ft is one of the two functions ft (x) {(x - 1)2 , (x + 1)2 }. For this scenario, the simple "follow-the-leader" algorithm, which plays the optimum decision so far - xt+1 = t-1 · t yi , i=1 where yt is {±1} appropriately - is known to attain O(log T ) regret (Cesa-Bianchi & Lugosi, 2006). Consider the case in which the function is (x - 1)2 for the first half of the game iterations, and then (x + 1)2 for the rest. The optimum is hindsight is the point 0, and that is exactly where all present algorithms that give O(log T ) regret converge to. However, this behavior is disastrous in terms of adaptive regret- current low regret algorithms take too long (linear time) to converge to -1, and attain adaptive regret of (T ). The goal of standard Regret minimization is too encumbered with the past to be able to shift rapidly. In contrast, the following Theorem asserts an algorithm whose "shifting time" is poly-logarithmic. Theorem 1.1. For online convex optimization with exp-concave loss functions (i.e. the online portfolio selection problem), there exists an algorithm with Adaptive-RegretT = O(log2 T ) 1 and running time poly(n, log T ). A natural question that arises is whether the O(log2 T ) adaptive regret is tight. In the full version of this paper (see (Hazan & Seshadhri, 2007)) we show that in fact O(log T ) adaptive regret is attainable and tight. However, the running time deteriorates to poly(n, T ). We can also ensure that the "standard regret" remains O(log T ). 1 RegretT (A) = t=1 ft (xt ) - min x K ft (x ) t=1 We consider an extension of the above quantity to measure the performance of a decision maker in a changing environment: Definition 1.1. The adaptive regret of an online convex optimization algorithm A is defined as the maximum regret it achieves over any contiguous time interval. Formally Adaptive-RegretT (A) s s sup I=[r,s][T ] t=r ft (xt ) - min xI K ft (x ) I t=r Efficient learning algorithms for changing environments 1.2. Logarithmic vs. polynomial regret The results stated above assume exp-concave loss functions (which apply to portfolio selection). Our reductions are general enough to apply to convex cost functions (i.e. linear costs) 2 , however in this abstract we focus on the exp-concave case which allows for logarithmic, rather than the usual square-root, regret. The reason is that in this setting no polynomial time adaptive algorithms were known. In contrast, Zinkevich's algorithm for general convex functions (Zinkevich, 2003) does run in polynomial time and has adaptive guarantees. For the latter general convex case our improvement is in terms of efficiency only, although not as dramatic (for example, in online shortest paths, Zinkevich's algorithm would require to solve a convex program which takes time ~ roughly O(n3.5 ), vs. our implementation which would require O(log T ) shortest path computations). 1.3. Relation to previous work The notion of dealing with stronger performance notions than regret has been dealt with in (Herbster & Warmuth, 1998; Bousquet & Warmuth, 2003) on "tracking the best expert". Their focus was on the discrete expert setting and expconcave loss functions. In this scenario, they proved regret bounds versus the best k-shifting expert, where the optimum in hindsight is allowed to change its value k times. Freund et al. (1997) generalize this to a setting in which only a subset of the experts make predictions in different iterations. This is further generalized by Lehrer (2003) and Blum and Mansour (2007) to deal with more complicated situations where the total loss of an expert is computed by assigning a weight wt to the loss of the expert in round t. The notion of adaptive regret is a (scaled) special case of these generalized regret definitions, in which temporal locality is given special importance. Our setting differs from these expert settings in the following respects. We consider continuous 2 In the full version of this manuscript, available at (Hazan & Seshadhri, 2007), this generalization will be detailed decision sets rather than discrete. Although it is possible to discretize continuous sets and apply previous algorithms, such reductions are inefficient, resulting in exponential time algorithms. As for performance guarantees, the notion of adaptive regret generalizes (and is not equivalent to) regret versus the best k-shifting optimum: an algorithm with O(R) adaptive regret obviously has O(kR) regret against a k-shifting comparator. The converse is not necessarily true, since regret can be negative. In independent work, Kozat and Singer (2007) attained related bounds of regret vs. a k-shifting strategy for the portfolio management problem. Our setting is more general, as it allows to tackle the general online convex optimization problem efficiently, and the techniques used are completely different. 2. Preliminaries Various online problems can be modeled in the online convex optimization framework, as defined above. For example, the online portfolio selection problem (Cover, 1991) is modeled by taking the convex set to be the set of all distributions over n assets - i.e. the n dimensional simplex. The loss functions are taken to be ft (x) = - log(x · rt ), where rt is the return vector, a non-negative vector which contains in each coordinate the ratio of closing to opening price for the corresponding asset. We say that a loss function is -exp-concave if the function e-f (x) is concave (i.e. the loss functions in online portfolio selection). For simplicity we henceforth assume that the cost functions are bounded over the decision set in absolute value by one (generalization is a simple matter of scaling). We shall base our results on the following wellknown fact from online learning theory: Fact 2.1. There exist algorithms for online convex optimization with -exp-concave loss func1 tions which attain regret O( log T ) and run in time O(n3 ) (Hazan et al., 2006). Any online algorithm must suffer worst case regret of 1 RegretT (OPT) = ( log T ) (Cover, 1991). Efficient learning algorithms for changing environments 3. The Algorithm The basic idea of our algorithm is to reduce the continuous optimization problem at hand to the discrete realm of experts, which are themselves online optimization algorithms. We chose the "sub algorithms" such that at least one is guaranteed to have good performance at any game iteration. To choose amongst the experts, we apply a twist on the well studied Multiplicative Weights method (the standard approach needs to be modified since our expert set keeps changing throughout the game and since we require additional adaptivity properties). In order to improve efficiency, we prune the set of experts which are added online. We formally define the properties required of the ground set of experts, and the resulting recipe turns out to be an easily expressible abstract streaming problem. Incorporating the data streaming ideas yields an efficient algorithm. 3.1. Algorithm description The basic algorithm, which we refer to as Followthe-Leading-History (FLH), is detailed in the figure below. It uses many online algorithms, each attaining good regret for a different segment in history, and chooses the best one using experttracking algorithms. The experts are themselves algorithms, each starting to predict from a different point in history. The meta-algorithm used to track the best expert is inspired by the HerbsterWarmuth algorithm (1998). However, our set of experts continuously changes, as more algorithms are considered and others are discarded. The experts that we use, denoted by E 1 , · · · , E T are low-regret algorithms that use different starting points in time to make predictions. The expert E t will be a standard online algorithm that starts making predictions from round t and does not consider the history before time t. To maintain all of these experts simultaneously would be too time consuming. For the sake of efficiency, we maintain a working set of experts, St , that changes every round. At round t, the pertinent set of experts is {E 1 , · · · , E t } (abusing notation, we will refer to experts by their indices, so this set is just [1, t]). The set St will be a very sparse subset of [1, t]. After round t, St is updated to get St+1 . This is done by adding t + 1 (the expert E t+1 ) and removing some experts from St . Once removed, an expert cannot be added to the working set. The problem of maintaining the set of active experts can be thought of as the following abstract data streaming problem. Suppose the integers 1, 2, · · · are being "processed" in a streaming fashion. At time t, we have "read" the positive integers upto t and maintain a very small subset of them St . A time t we create St+1 from St : we are allowed to add to St only the integer t + 1, and remove any integer already in St . Our aim is to maintain a short "sketch" of the data seen so far. We now describe the algorithm Follow-theLeading-History. For the sake of clarity, we separately explain how the working set of experts is maintained. Generation of working sets: We maintain the working sets using an algorithm due to Woodruff (2007) (there is another randomized solution to this streaming problem due to (Gopalan et al., 2007), which is simpler to apply but gives somewhat worse bounds). Any integer i can be uniquely written as i = r2k where r is odd. Let the lifetime of integer i be 2k+2 + 1. Suppose the lifetime of i is m. Then for any time t [i, i + m], integer i is alive at t. The set St is simply the set of all integers that are alive at time t. Obviously, at time t, the only integer added to St is t. Woodruff proved that the following properties are maintained by the scheme above. The first, and most important, property of the sets St essentially means that St is "well-spread" out in a logarithmic scale. This is depicted in Figure 1. Figure 1. The set St Efficient learning algorithms for changing environments Property 3.1. 1. For every positive s t, [s, (s + t)/2] St = . 2. For all t, |St | = O(log T ). 3. For all t, St+1 \St = {t + 1}. Algorithm 1 Follow-the-Leading-History (FLH) Let E 1 , ..., E T be online convex optimization algorithms. Let St [t] be a set of experts, S1 = {1}. Initialize p1 = 0, for any t pt is a 1 distribution over St . 2: for t = 1 to T do (j) 3: Set j St , xt E j (ft-1 ) (the prediction of the j'th algorithm) and (j) (j) play xt = jSt pt xt . 4: Multiplicative Update - After receiving ft , (t+1) set pt+1 = 0 and perform update for i St ^ 1: 3.2. Proof of performance guarantees The low adaptive regret guarantees of FLH are a consequence of the following two lemmas. The first lemma is obtain as a consequence of the "expert" algorithm applied to the set of experts St . Lemma 3.1. For any interval I = [r, s] in time, suppose that E r stays in the working set throughout I. Then, the algorithm FLH gives O(-1 (ln r + ln |I|)) regret with respect to the best optimum in hindsight for I. Before proving this lemma, let us state the following lemma in which the streaming algorithm comes into effect, and prove the main theorem. Lemma 3.2. For interval I = [r, s], the regret incurred by the FLH for any interval I is at most 1 O( log s · log |I| + 1). Proof. Let |I| [2k , 2k+1 ). We will prove by induction on k. base case: For k = 1, the bound on the adaptive regret is an easy consequence of the fact that the cost functions are bounded in absolute value by one, hence ft (xt ) - ft (x ) log t · log 1 + 1 = 1. t induction step: By the properties of the work sets {St }, there is an expert E i in the working set Ss at time s, with i [r, (r + s)/2], such that the following holds. This expert E i entered the working set at time i and stayed throughout [i, s]. By Lemma 3.1, the algorithm incurs adaptive regret, 1 and hence regret, of at most c1 · (log i + log |I|) in [i, s], for some c1 0. Formally, since s |I|, x . s t=i ft (xt ) pt+1 = ^ 5: 6: (i) (i) (i) pt e-ft (xt ) (j) -ft (x(j) ) t jSt pt e Addition step - Set pt+1 (i) pt+1 (t+1) to 1/(t + 1) and (i) for i = t + 1: = (1 - (t + 1)-1 )pt+1 Pruning step - Update St to the set St+1 . For all i St+1 : (i) pt+1 = pt+1 ^ jSt+1 (i) pt+1 ^ (j) 7: end for Note that the running time per round is bounded by the size of the St 's times the running time of each expert. For efficiency, it is crucial that the St sets are very small. Yet for maintaining low adaptive regret, we will need the first and third properties, i.e. the well-spread out nature of St and the fact that every new integer "gets a chance" and is always added. We henceforth prove the following theorem: Theorem 3.1. If all loss functions are -exp concave then the FLH algorithm attains adaptive regret of O(-1 log2 T ). The running time per iteration is O(n3 log T ). - ft (x ) 2c1 c1 (log i + log |I|) log s The interval I2 = [r, i-1] has size |I2 | [2k-1 , 2k ) at most half of the entire interval I, and by induction the algorithm has regret of at most c2 c2 log |I2 |·log i k log s on this interval for some c2 > 0, i.e. i-1 x . t=r ft (xt ) - ft (x ) c2 k log s Efficient learning algorithms for changing environments Combining both previous equations, x . s c2 2c1 t=r ft (xt ) - ft (x ) (k + c2 ) log s c2 (k+1) log s c2 log s · log |I| Hence, ft (xt ) - ft (xt ) -1 (ln e-ft (xt e jSt (i) (i) ) - ln jSt pt e-ft (xt ) ) (j) (j) Proving the induction hypothesis for a constant c2 which satisfies c2 2c1 . We can now deduce Theorem 1.1: By Fact 2.1, the running time of FLH is bounded by O(|St |n3 ). Since |St | = O(log t), we can bound the running time by O(n3 log T ). This fact, together with Lemma 3.2, completes the proof of Theorem 3.1. Note that by always keeping E 1 in the working set, we can ensure that the (standard) regret is bounded by O(log T ). We proceed to complete the missing step in the proof above, i.e. Lemma 3.1. 1 By assumption expert E r gives log |I| regret in the interval I (henceforth, the time interval I will always be [r, s]). We will show that FLH will be competitive with expert E r in I. To prove Lemma 3.1, it suffices to prove the following claim. = -1 ln 1 pt = -1 ln (i) (i) (i) -ft (xt ) pt e-ft (xt (i) (j) (j) (j) ) (i) = -1 ln · pt e-ft (xt jSt ) (j) pt e-ft (xt ) pt+1 ^ pt (i) (1) The lemma is now obtained using the bounds of Claim 3.3 below. (i) (i) Claim 3.3. (t) 1. For i St , ln pt ln pt - 2/t ^ 2. ln pt - ln t Proof. For i St , pt pt = (1 - 1/t)^t . Also, p (t) (t) pt pt = 1/t. Taking the natural log of both these inequalities completes the proof. We are now ready to prove Claim 3.1. Proof. (Claim 3.1) We are looking at regret in I with respect to an expert E r . Since r St , for any t I, we can apply Claim 3.2. s (i) (i) (i) Claim 3.1. For any I = [r, s], suppose that E r stays in the working set throughout I. The regret incurred by FLH in I with respect to expert E r is 2 at most (ln r + ln |I|). We first prove the following claim, which gives bounds on the regret in any round and then sum these bounds over all rounds. Claim 3.2. 1. For i St , (i) (i) (i) (ft (xt ) - ft (xt )) t=r s (r) ft (xt ) - ft (xt ) -1 (ln pt+1 - ln pt + 2/t) ^ ^ 2. ft (xt ) - (t) ft (xt ) (t) -1 (ln pt+1 ^ = (fr (xr ) - fr (x(r) )) + r + ln t) (r) -1 ln pr+1 ^ s (ft (xt ) - ft (xt )) t=r+1 (r) + ln r (r) (r) Proof. Using the -exp concavity of ft e -ft (xt ) = e -ft ( (j) jSt p t xt ) (j) (j) (j) + -1 (ln pt+1 - ln pt + 2/t) ^ ^ t=r+1 s pt e-ft (xt jSt ) = (ln r + (r) ln ps+1 ^ + t=r+1 2/t) Taking logarithm, ft (xt ) --1 ln jSt pt e-ft (xt (j) (j) ) Since ps+1 1, ln ps+1 0. This implies that the ^ ^ regret is bounded by 2-1 (ln r + ln |I|). (r) (r) Efficient learning algorithms for changing environments 4. Experimental evidence in support of adaptive regret We implemented the Online Newton Step algorithm (ONS) of (Hazan et al., 2006; Agarwal et al., 2006), as well as the its adaptive version, as given in the above reductions. We used the exact same data set of (Agarwal et al., 2006) to repeat the same tests for these two algorithms. The tests are performed over a set of 50 random S&P 500 stocks and a period of 1000 trading days. Over this ground set of stocks, we chose n [5, 10, 15, 20, 25] random stocks, and applied the two algorithms which were allowed to trade once every five trading days. This set of n random stocks is sampled one hundred times, and the final result is averaged. The figure below depicts the performance of ONS and its adaptive version in terms of APY (annual percentage yield). troduced online convex optimization algorithms which are adaptive in the sense that the converge to the local optimum of a contiguous time interval at almost the fastest possible rate. This is a generalization of previous adaptive algorithms, which were designed for the discrete setting, and allows us to tackle continuous problems such as portfolio management or problems with a large decision set i.e. online shortest paths. As opposed to previous approaches, we base our reductions on streaming techniques and thus overcome the efficiency barrier which was prohibitive in previous approaches. However, the sketching/streaming techniques are tailored for the particular notion of adaptivity we consider, which is of time-locality, i.e. optimum for a contiguous interval in time. An interesting research direction is to generalize our result to even stronger notions of adaptivity, such as the "swap regret" notion of Blum and Mansour. References Agarwal, A., Hazan, E., Kale, S., & Schapire, R. (2006). Algorithms for portfolio management based on the newton method. Proceedings of the 23rd International Conference on Machine Learning (ICML) (pp. 9­16). Blum, A., & Mansour, Y. (2007). From external to internal regret. The Journal of Machine Learning Research, 8, 1307­1324. Bousquet, O., & Warmuth, M. K. (2003). Tracking a small set of experts by mixing past posteriors. J. Mach. Learn. Res., 3, 363­396. Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge University Press. Cover, T. (1991). Universal portfolios. Math. Finance, 1, 1­19. Freund, Y., Schapire, R. E., Singer, Y., & Warmuth., M. K. (1997). Using and combining predictors that specialize. In Proceedings of the Twenty-Ninth Annual ACM Symposium on the Theory of Computing (pp. 334­343). Figure 2. Standard ONS vs Adaptive ONS As can be clearly seen, adaptivity gives a consistent one to two percent improvement in terms of APY, which amounts to roughly ten percent improvement in performance. Similar performance gains were observed in the other tests proposed by (Agarwal et al., 2006). 5. Conclusions and Future Work In this paper we have investigated the notion of learning in a changing environment, and in- Efficient learning algorithms for changing environments Gopalan, P., Jayram, T. S., Krauthgamer, R., & Kumar, R. (2007). Estimating the sortedness of a data stream. SODA (pp. 318­327). SIAM. Hazan, E., Kalai, A., Kale, S., & Agarwal, A. (2006). Logarithmic regret algorithms for online convex optimization. Proceedings of 19th Annual Conference on Learning Theory, (COLT) (pp. 499­513). Hazan, E., & Seshadhri, C. (2007). Adaptive algorithms for online decision problems. Electronic Colloquium on Computational Complexity (ECCC), 14. Herbster, M., & Warmuth, M. K. (1998). Tracking the best expert. Mach. Learn., 32, 151­178. Kozat, S., & Singer, A. (2007). Universal constant rebalanced portfolios with switching. IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP) (pp. 1129­ 1132). Lehrer, E. (2003). A wide range no-regret theorem. Games and Economic Behavior, 42, 101­ 115. Woodruff, D. (2007). personal communications. . Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. Proceedings of the Twentieth International Conference (ICML) (pp. 928­936). trivially true when t - s < 2, since t - 1, t St . Let 2 be the largest power of 2 such that 2 (t - s)/2. There is some integer x [s, (s + t)/2] such that 2 |x. The lifetime of x is larger than 2 × 2 + 1 > t - s, and x is alive at t. Proof. (Property (2)) For 0 k log t , let us count the number of integers of the form r2k (r odd) alive at t. The lifetime of these integers are 2k+2 +1. The only integers alive lie in the interval [t - 2k+2 - 1, t]. Since all of these integers of this form are separated by gaps of 2k , there are at most a constant number of such integers alive at t. Totally, the size of St is O(log t). A. The streaming problem We now explain Woodruff's solution for maintaining the set St [1, n] in a streaming manner. We specify the lifetime of integer i - if i = r2k , where r is odd, then the lifetime of i is the interval 2k+2 + 1. Suppose the lifetime of i is m. Then for any time t [i, i + m], integer i is alive at t. The set St is simply the set of all integers that are alive at time t. Obviously, at time t, the only integer added to St is t - this immediately proves Property (3). We now prove the other properties Proof. (Property (1)) We need to show that some integer in [s, (s + t)/2] is alive at time t. This is