Minimizing Wide Range Regret with Time Selection Functions Subhash Khot and Ashok Kumar Ponnuswami New York University and Georgia Tech khot@cs.nyu.edu and pashok@cc.gatech.edu Abstract We consider the problem of minimizing regret with respect to a given set S of pairs of time selection functions and modifications rules. We give an onT line algorithm that has O( log |S |) regret with respect to S when the algorithm is run for T time steps and there are N actions allowed. This imT proves the upper bound of O( N log(|I ||F |)) given by Blum and Mansour [BM07a] for the case when S = I × F for a set I of time selection functions and a set F of modification rules. We do so by giving a simple reduction that uses an online algorithm for external regret as a black box. 1 Introduction We consider the following online optimization problem. At the beginning of each day (a time step), we have to choose one of the N allowed actions. Instead of picking one action deterministically, we may come up with a distribution over the actions. At the end of the day, an adversary, with the knowledge of the distribution we picked, fixes a loss for each action. We give a concrete example from Cesa-Bianchi et al. [CBFH+ 97]. Suppose we want to predict the probability that it rains on a day based on the predictions of N weather forecasting websites. But we don't know which of these "experts" give good forecasts. We come up with some weights on the websites using an online algorithm and use the weighted prediction as our guess for the probability of raining. At the end of the day, based on whether or not it rained, everyone is incurs a loss depending on how inaccurate their prediction was. Usually it is assumed that the loss for each action is picked from a fixed interval, like [0, 1]. For example, we could charge a person who predicts p as the probability of rain 1 - p if it rains and p if does not. After T days, we compare the loss incurred by the online algorithm we used to the loss incurred if we had followed a simple strategy (like just picking the same action each day). Our goal is to minimize our regret for not following one of the simple strategies. One may also compare the algorithm's performance to the performance if the distribution over actions at each time step were modified using a certain set of rules. We consider the problem of designing algorithms with low regret with respect to a given set of strategies or modification rules. The most basic regret studied is external regret, which is the difference between the loss incurred by the algorithm and the loss incurred by the best action in hindsight. Another kind of regret commonly studied is called internal regret. This was introduced by Foster and Vohra [FV98]. Here, we consider the set of modification rules where for each pair (a, b) of actions we have a rule of the kind: Every time the algorithm suggests picking a, pick b instead. The internal regret of the algorithm is the regret of not having applied one of these modification rules. Each rule here can be considered as a function fa,b that maps every action to itself, except action a which gets mapped to b. If we consider the set of modification rules corresponding to all functions mapping the set of actions into itself, we get the notion of swap regret. Finally, we can allow any subset of these mappings as the set of allowed modification rules which gives the notion of wide range regret. This was defined by Lehrer [Leh03]. Lehrer also associates time selection function with each rule that indicates whether a rule is "active" at a give time or not. A related model is that of "sleeping experts" or "specialists" defined in Freund et al. [FSSW97]. Here, at the beginning of time t, each specialist can decide whether are not the current situation is her area of speciality and make a prediction only if it does. In addition, Blum and Mansour [BM07a] consider the case where the experts can be "partially awake". One way to interpret the activeness function is that it measures degree of confidence that the corresponding rule will perform well at a given time. In this case, we weigh the loss incurred by the algorithm and the modified action with the time selection function to calculate the regret. The first algorithm with external regret sublinear in T was developed by Hannan [Han57]. An algorithm whose external regret has only logarithmic dependence on N was given by Littlestone and Warmuth [LW94] and Cesa-Bianchi et al. [CBFH+ 97]. Lemma 1 ([CBFH+ 97]) Therexists an online algorithm e with external regret at most O( T log N ) when the losses are picked from [-1, +1]. The running time is polynomial in T and N . The number of time steps T for which it will be run need not be provided as an input to the above algorithm. Stoltz and Lugosi [SL05] give a general method to convert any "weighted average predictor" algorithm for external regret to a low internal or swap regret algorithm. At a high level, they pretend there is an expert for each modification rule who always suggests using that rule. At each time step, the expert is charged the loss that would be incurred if his modification rule were actually used. The weighted average predictor would give a distribution over the experts. The distribution over the actual actions is found by computing the fixed point of the expected modification rule picked from theistribution over the experts. This gives algorithms with d O( T log N ) internal regret and O( T N log N ) swap regret. Our approach for wide range regret with time selection functions is based on the same idea. A drawback of a swap regret algorithm constructed this way is that it needs to maintain N N weights. lum and Mansour [BM07a] give B an algorithm that has O( T N log N ) swap regret and runs in time polynTmial in N too. They also give an algorithm o N log(K M )) regret with respect to K modithat has O( fication rules and M time selection functions. Here, for each modification rule and time selection function, the regret of not having modified the algorithms action by the rule with the losses weighed by the time selection function is considered. In this case, we can think of there being M people who are interested in following an algorithm's predictions. They have varying degrees of importance associated with each day (given by their corresponding time selection function) and want to minimize regret with respect to all the modification rules. The algorithm's goal is to minimize the maximum regret of a person. This is a bit different from the model considered in Lehrer [Leh03]. But with some effort, one can check that the result of Blum and Mansour [BM07a] can be generalized to the model of Lehrer [Leh03]. We refer the reader to [BM07b]for other bounds on the regret minimization and the relation of various kinds of regret to equilibriums in games. The paper is organized as follows. In the next section, we define the model we work with formally and state our main result. We state the ideas we use from related results in Section 3. We prove our main result of an improved upper bound for wide range regret in Section 4. We conclude with a "first-order" upper bound in Section 5. regret of H to be RH,ext = max RH,a . a[N ] We now define the model with time selection functions from Blum and Mansour [BM07a]. A time selection function is a function I : N - [0, 1]. Let I be the set of time selection functions. At the beginning of time t, the adversary sets the values of I (t) for each I I . The algorithm then picks pt after which the adversary now picks lt as before. Given a modification rule f : [N ] - [N ], define Mf to be the matrix with a 1 in column f (i) of row i for all i and zeros everywhere else. Define the regret of H with respect to time selection function I and a modification rule f to be t a t t RH,I ,f = I (t) pt (la - lf (a) ) a [N ] = t I (t)(pt · lt - pt Mt f l ). Informally, we first weigh all the losses at time t by I (t), the significance attached to time t. Then we look at the difference between the expected loss of H and the expected loss if the output of H were modified every time by applying f . That is, we measure the regret of not having played action f (a) every time we played a. Given a set S of pairs (I , f ), where I is a time selection function and f is a modification rule, the wide range regret of H with respect to S is defined as RH,S = max RH,I ,f . (I ,f )S 2 Our Model and Result Let the set of actions be [N ] = {1, 2, . . . , N }. Consider the following T round game between an online algorithm H and an adversary. At the beginning of time t = 1, 2, . . . , T , the algorithm picks a probability vector1 2 pt = (pt , pt , . . . , pt ). 12 N tt t The adversary then picks the loss vector lt = (l1 , l2 , . . . , lN ) t for time t. The entries of l are picked from a fixed interval. In this paper, we assume the losses are either picked from [0, 1] or from [-1, +1]. Define the regret of H with respect to action a [N ] to be RH,a = tT b ( =1 [N ] t t pt l b - l a ) = b Let 1 : N - [0, 1] be the function that always outputs 1, i.e., 1(t) = 1. For simplicity of notation, we will use f to also denote the pair (1, f ) when we are not concerned with time selection functions, in which case we assume that the adversary always sets 1(t) to 1. It is easy to check that external regret is the same RH,Fext where Fext = {fa }a[N ] and b [N ] : fa (b) = a. The internal regret of H is defined to be RH,Fint , where Fint = {fa,b }a,b[N ] and fa,b (a) = b while fa,b (c) = c for c = a. The swap regret of H is defined to be RH,Fswap , where Fswap is the set of all functions f : [N ] - [N ]. We prove the following theorem for minimizing wide range regret. Theorem 2 There exists an online algorithm H that for any given set S satisfies T log |S |) when the losses are picked · RH,S = O( from the [0, 1]. · The running time of H is polynomial in T , N and |S |. tT b =1 [N ] t t pt (lb - la ). b This can be interpreted as the difference between the expected loss of H and the loss of action a. Define the external 1 A probability vector is a vector in which the entries are nonnegative and sum to 1. 2 All vectors we consider are column vectors. We will use to denote the transpose. Note that this matches (upto a constant) the results for external, internal and swap regret if we are not concerned with time selection functions. A drawback of our apporoach is that if the size of the set S is large, the running time is high. For example, for swap regret with time selection functions, we may need time polynomial in T and N N . But for this case, the result of Blum and Mansour already gives a more efficient algorithm with the same regret (upto a constant). 3 Previous Results We use ideas from Stoltz and Lugosi [SL05] and Blum and Mansour [BM07a]. We first describe the approach of Stoltz and Lugosi [SL05] for internal regret. The idea is to simulate a low external regret algorithm for N (N - 1) imaginary experts. Start with any "weighted average predictor" Hext with low external regret. There are N (N - 1) imaginary experts, one for each modification rule fa,b . The expert corresponding to fa,b always suggests playing b instead of a. We will specify how the probability weights over the actual actions are calculated from the output of Hext and how the losses are generated for the imaginary experts of Hext . t At time t, suppose Hext outputs probability qa,b for the expert corresponding to fa,b . Then compute the probability vector pt = (pt , pt , . . . , pt ) on the actual actions as a fixed 12 N point of a t pt = qa,b pt b , a ,b[N ] show that this achieves a low external regret with respect to all time selection functions. To generalize this idea for wide range regret, where S = I × F , they introduce an expert for each a [N ], I I and t f F . There is a weight wa,I ,f for each such expert. Note that this does not simplify to the reduction in the previous paragraph for the case when F = Fext . Instead, in the next section we obtain a reduction where there are experts only for each (I , f ) S . Intuitively, this is where we remove the polynomial dependence of wide range regret on N and obtain a slightly simpler reduction. 4 A Reduction from Wide Range Regret to External Regret We will prove Theorem 2 in this section. We first give an algorithm that when given a low external regret algorithm as a black box uses it to guarantee low wide range regret. Theorem 3 Given an algorithm Hext with external regret R(T , N ) when the losses are from [-1, +1], one can construct an algorithm H that when given losses from [0, 1] satisfies: · RH,S = R(T , |S |) · The running time of H is polynomial in the running time of Hext , T , N , and |S |. Idea: H will basically simulate an instance of Hext with the elements of S being the actions. Figure 1 shows the inputs and outputs of H and Hext at time t. At time t, Hext prot t duces some qI ,f for each (I , f ) S , where the qI ,f form a probability distribution over S . H will then use this to come up with a probability vector pt = (pt , pt , . . . , pt ) on the 12 N actual actions. H will basically pick a random (I , f ) with t probability proportional to I (t)qI ,f . After this, it picks a t vector p over the actual actions such that pt is a fixed point of such a random f , i.e., modifying pt by f in expectation just yields pt . Intuitively, the loss passed to the black box t Hext for (I , f ) is such that qI ,f measures the regret with respect to time selection function I of not having modified the output of H using function f . Multiplying this by I (t) takes care of the relevance of (I , f ) at time t. Basically, the algorithm makes sure that if the regret with respect to (I , f ) was large so far, then that regret doesn't increase at the current step. Proof: We first specify how H computes pt and l t at time t. ( t To compute pt , get qt from Hext . If I ,f )S I (t)qI ,f = 0, then output any probability vector p. Otherwise define pt to be any vector satisfying . t I (t)qI ,f Mf , t=t (I (f )S p p (1) t I ,f )S I (t)qI ,f ( t This is well defined since I ,f ) I (t)qI ,f = 0. Such a vector t p exists since every row of ( t I ,f ) I (t)qI ,f Mf ( (2) t I ,f ) I (t)qI ,f where denotes the probability vector obtained from p by changing the weight of action a to zero at putting it on action b. This can also be expressed as a t t pt = qa,b pt Mfa,b = pt a qa,b Mfa,b . ,b ,b pt b a t Let the adversary return back lt as the loss vector at time t. The loss incurred at time t by each of the imaginary experts for fa,b is calculated as t lfa,b = lt · ptj = pt i t M fa,b l . This quantity can be thought of as the loss incurred if we followed the expert's suggestion of playing b instead of a. Stoltz and Lugosi [SL05] showed that this achieves low internal regret. For an arbitrary set of modification rules F , we have an expert for each modification rule f F and the probability and loss vectors are now calculated as pt = pt and f F t lf = pt Mf lt . We now discuss the ideas we use from Blum and Mansour [BM07a]. We start with the case where S = I × Fext for some I . In this case, there is an expert for each (I , fa ) t S . There is a weight wI ,a associated with this expert at the end of time t where t wI ,a = -RI ,a ~t t qf Mf and ~t RI ,a = t t I (t )( lt - lt ) =1 H a t for some parameter (0, 1). Above, lH is the actual loss ~ t is called a "less-strict" incurred at time t. The quantity RI ,a external regret. The probability pt associated with the action a t t a at time t is then proportional to I I I (t)wI ,a . By optimizing for the parameter , Blum and Mansour [BM07a] ( t (Case 1:) Suppose J,g ) J (t)qJ,g = 0. In this case we can use (1) to get ( ( ( t t qJ,g lJt,g = qJ,g J (t) pt lt ) J,g )S J,g ) Figure 1: The reduction from wide range to external regret. is a probability vector because Mf has exactly one 1 in each row. That is, (2) defines the transition matrix of a Markov chain. When H gets back loss vector lt , it computes a t t lIt,f = I (t) pt (lf (a) - la ) = I (t)pt (Mf - I)lt a N where I is the identity matrix. This yields t t lIt,f = I (t)pt (Mf - I)lt = -RH,I ,f . That is, lIt,f is exactly the decrease at time t of the regret with respect to (I , f ). It is easy to check that lIt,f [-1, +1]. From the low external regret guarantee of Hext , for all (I , f ) S : t( t t qJ,g lJt,g lIt,f + R(T , |S |). (4) J,g )S We will next show that ( J,g )S t qJ,g lJt,g = 0. Together with (3) and (4), this will show that for all (I , f ) S, 0 -RH,I ,f + R(T , |S |), or RH,I ,f R(T , |S |) which proves the theorem. We now proceed to prove (5). ( ( t t qJ,g lJt,g = qJ,g J (t)pt (Mg - I)lt J,g )S J,g ) t qJ,g J (t)pt Mg lt J,g ) ( J,g ) = ( - l t ( J,g ) t qJ,g J (t)pt lt = pt t J (t)qJ,g Mg - ( J,g ) ( t qJ,g J (t) pt 2 ' 2 ' 2 ' B 7 9 98 @A 0 65 0 0 1 ) 3 ¨G ) G G £ £ ! £ G G G G G G G G G G G £ 4 !4 £G G £4 G © ¤ § £ ¡ G G G © ¨¦¤ ¥ §¥ £ ¢ ¡ § © ¤ £¡ " G G G G G G ) G E© ¤ § £C G G G ¥© F¦¤ C §¥ £ § E© D¤ C £ '%# (&$" - = 0. ( ( J,g ) ( t qJ,g J (t) pt lt ) t t (Case 2:) Assume J,g ) J (t)qJ,g = 0. Then J (t)qJ,g = 0 t for all pairs (J, g ) since J (t) and qJ,g are all non-negative, which implies ( t qJ,g lJt,g = 0. J,g )S It can be seen easily that Theorem 3 and Lemma 1 imply Theorem 2. 5 A First-Order Bound for Wide Range Regret If we are only concerned with regret bounds as a function of T and N (called "zero-order" bounds in Cesa-Bianchi et al. [CBMS05]), Theorem 2 matches (up to a constant) the known upper bounds for external, internal and swap regret. One can also try to obtain "first-order" bounds, bounds that depend on the sum of payoffs of actions instead of the time. LFor example, Blum and Mansour [BM07a] show a O( min log(N M ) + log(N M )) upper bound for minimizing external regret with respect to a set I of M time selecttion functions, where Lmin = maxI mina LI ,a and LI ,a = t I (t)la . For the case when there is at least one "real" expert that does well most of the time, such a bound will be much tighter than a zero-order bound. One can hope to use external regret algorithms with good first-order bounds like the following to come up with good first-order bounds for wide range regret. Lemma 4 (Cesa-Bianchi et al. [CBFH+ 97]) There exists an algorithm with r ning time polynomial in T and N and exun ternal regret O( Lmin log N + log N ) when the losses are picked from [0, 1]. We need an algorithm that can handle losses from the interval [-1, +1] in Theorem 3. One way to use the algorithm from Lemma 4 is to map the losses lIt,f to the interval [0, 1] by a linear transformation. But this also changes the loss of best action and makes the first order bound obtained very weak. Another alternative is to tinker with the quantity that lIt,f signifies. If we are concerned only with modification rules (and not time selection functions), we can redefine lft as a t lft = pt lf (a) . a N lt (3) (5) ). But for technical reasons, this can't be done if we are also working with time selection functions. Note that the only term in (4) that depends on I and f is lIt,f , and hence it must capture all the terms that depend on either I or f in the definition of RH,I ,f . So we give a method based on the approach of Blum and Mansour [BM07a]. The main idea is to define a reduced regret for each pair (I , f ). Theorem 5 There exists an online algorithm that for any S satisfies: · ThL wide range regret with respect to S is at most O( e min log |S | + log |S |), where t Lmin = max min I (t)pt Mf lt . I (I ,f )S gives ( I ,f )S t wI ,f = ( I ,f ) t wI-1 I (t)(p ,f t M fl t - pt ·lt ) ( I ,f ) w × t-1 I ,f 1 - (1 - )I (t)pt Mt fl 1 + (1 - )I (t)pt · lt ( I ,f ) t qI ,f I (t)pt M t fl ( I ,f ) t wI-1 - (1 - )W t-1 ,f · The running time is polynomial in T , N , and |S |. Proof: Define the loss of H with respect to I till time t as t ( I ,f ) t qI ,f I (t)pt · lt + (1 - )W t-1 = ( I ,f ) ( I ,f ) t qI ,f I (t)Mf Lt ,I H = t =1 I (t)p · l , t t t wI-1 - (1 - )W t-1 pt ,f l t + (1 - )W t-1 = ( I ,f ) t wI-1 . ,f ( I ,f ) and the loss of H with respect to (I , f ) till time t as t ( t qI ,f I (t) pt · lt ) Lt ,I ,f = H t =1 I (t)pt Mt fl . We assume that at any time t, not all I (t) are zero. This is without loss of generality since in this case, the losses defined above don't change at time t. For some (0, 1) to be fixed later, we basically run a exponentially weighted predictor with a weight for each pair (I , f ). The weight of ~t t (I , f ) at the end of time t is wI ,f = -RH,I ,f , where ~t RH,I ,f = Lt ,I - Lt ,I ,f . H H That is, is a regret of H with respect to (I , f ) where t the incurred loss is reduced by a factor . We define qI ,f = ( t-1 t-1 t t wI ,f /W , where W = I ,f )S wI ,f is the sum of the weights. At time t, the algorithm does the following. It computes t qI ,f as above. The probability vector pt over the actual act tions is picked as in (2). This is well defined since wI ,f (and t hence qI ,f ) are all non-zero and at least one of the I (t) is also non-zero (by assumption). Then the algorithm updates all the losses and weights when it gets back lt from the adversary. We first show that the sum of the weights can not increase at any time. Claim 6 t : ( I ,f )S t wI ,f Above, the second inequality follows from the definition of t qI ,f and the last equality follows from (2). We now get back to the proof of the theorem. The claim implies that for all (I , f ) S , ( T T ~T T 0 -( LH,I -LH,I ,f ) = -RH,I ,f = wI ,f wJ,g = |S | J,g )S which gives ( LT ,I - LT ,I ,f ) log(1/ ) log |S | H H or LH,I LH,I ,f + log |S | log(1/ ) ~t RH,I ,f . Since for a given I , the statement is true for all f such that (I , f ) S , we can rewrite it as: LH,I where LH,I ,min = Setting so that f :(I ,f )S LH,I ,min + min log |S | log(1/ ) LH,I ,f . ( I ,f )S t wI-1 ,f -1 = 1 + min gives the theorem. l og |S | 1 , Lmin 2 Proof: We will use the fact that for any (0, 1) and x [0, 1], x 1 - (1 - )x and -x 1 + (1 - )x/ . This Acknowledgments The authors would like to thank Yishay Mansour for comments on an early draft of the paper. References [BM07a] Avrim Blum and Yishay Mansour. From external to internal regret. J. Mach. Learn. Res., 8:1307­1324, 2007. [BM07b] Avrim Blum and Yishay Mansour. Learning, regret minimization and equilibria. In Noam Nisan, Tim Roughgarden, Eva Tardos, and Vijay Vazirani, editors, Algorithmic Game Theory, chapter 4. Cambridge University Press, 2007. ` [CBFH+ 97] Nicolo Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. J. ACM, 44(3):427­485, 1997. ` [CBMS05] Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for prediction with expert advice. In COLT, pages 217­232, 2005. [FSSW97] Yoav Freund, Robert E. Schapire, Yoram Singer, and Manfred K. Warmuth. Using and combining predictors that specialize. In STOC '97: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 334­343, 1997. [FV98] Dean Foster and Rakesh V. Vohra. Asymptotic calibration. Biometrika, 85:379­390, 1998. [Han57] J. Hannan. Approximation to bayes risk in repeated plays. In M.Dresher, A. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume 3, pages 97­139. Princeton University Press, 1957. [Leh03] Ehud Lehrer. A wide range no-regret theorem. Games and Economic Behavior, 42(1):101­ 115, 2003. [LW94] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Inf. Comput., 108(2):212­261, 1994. ´ [SL05] Gilles Stoltz and Gabor Lugosi. Internal regret in on-line portfolio selection. Mach. Learn., 59(1-2):125­159, 2005.