Progressive mixture rules are deviation suboptimal Jean-Yves Audibert Willow Project - Certis Lab ParisTech, Ecole des Ponts ´ 77455 Marne-la-Vallee, France audibert@certis.enpc.fr Abstract We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g ) denotes the generalization error of a prediction function g , under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule g satisfies ^ ER(g ) mingG R(g ) + Cst logn|G | , ^ (1) where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G , the deviation convergence rate of the progressive mixture rule is no better than Cst / n: it fails to achieve the expected Cst /n. We also provide an algorithm which does not suffer from this drawback. 1 Introduction Why are we concerned by deviations? The efficiency of an algorithm can be summarized by its expected risk, but this does not precise the fluctuations of its risk. In several application fields of learning algorithms, these fluctuations play a key role: in finance for instance, the bigger the losses can be, the more money the bank needs to freeze in order to alleviate these possible losses. In this case, a "good" algorithm is an algorithm having not only low expected risk but also small deviations. Why are we interested in the learning task of doing as well as the best prediction function of a given set? First, one way of doing model selection among a finite family of submodels is to cut the training set into two parts, use the first part to learn the best prediction function of each submodel and use the second part to learn a prediction function which performs as well as the best of the prediction functions learned on the first part of the training set. This scheme is very powerful since it leads to theoretical results, which, in most situations, would be very hard to prove without it. Our work here is related to the second step of this scheme. Secondly, assume we want to predict the value of a continuous variable, and that we have many candidates for explaining it. An input point can then be seen as the vector containing the prediction of each candidate. The problem is what to do when the dimensionality d of the input data (equivalently the number of prediction functions) is much higher than the number of training points n. In this setting, one cannot use linear regression and its variants in order to predict as well as the best candidate up to a small additive term. Besides, (penalized) empirical risk minimization is doomed to be suboptimal (see the second part of Theorem 2 and also [1]). As far as the expected risk is concerned, the only known correct way of predicting as well as the best prediction function is to use the progressive mixture rule or its variants. These algorithms are introduced in Section 2 and their main good property is given in Theorem 1. In this work we prove that they do not work well as far as risk deviations are concerned (see the second part of Theorem 1 3). We also provide a new algorithm for this 'predict as well as the best' problem (see the end of Section 4). 2 The progressive mixture rule and its variants We assume that we observe n pairs of input-output denoted Z1 = (X1 , Y1 ), . . . , Zn = (Xn , Yn ) and that each pair has been independently drawn from the same unknown distribution denoted P . The input and output spaces are denoted respectively X and Y , so that P is a probability distribution on the product space Z X × Y . The quality of a (prediction) function g : X Y is measured by the risk (or generalization error): R(g ) = E(X,Y )P [Y, g (X )], where [Y, g (X )] denotes the loss (possibly infinite) incurred by predicting g (X ) when the true output is Y . We work under the following assumptions for the data space and the loss function : Y × Y R {+}. Main assumptions. The input space is assumed to be infinite: |X | = +. The output space is a non-trivial (i.e. infinite) interval of R symmetrical w.r.t. some a R: for any y Y , we have 2a - y Y . The loss function is y · uniformly exp-concave: there exists > 0 such that for any y Y , the set R: i ) (y, y ) < + s an interval containing a on which the function y e- (y,y is concave. · symmetrical: for any y1 , y2 Y , (y1 , y2 ) = (2a - y1 , 2a - y2 ), · admissible: for any y , y Y ]a; +[, (y, 2a - y ) > (y, y ), · well behaved at center: for any y Y ]a; +[, the function y : y (y, y ) is twice continuously differentiable on a neighborhood of a and y(a) < 0. These assumptions imply that · Y has necessarily one of the following form: ] - ; +[, [a - ; a + ] or ]a - ; a + [ for some > 0. · for any y Y , from the exp-concavity assumption, the function y : y (y, y ) is convex on the interval on which it is finite1 . As a consequence, the risk R is also a convex function (on the convex set of prediction functions for which it is finite). The assumptions were motivated by the fact that they are satisfied in the following settings: · least square loss with bounded outputs: Y = [ymin ; ymax ] and (y1 , y2 ) = (y1 - y2 )2 . Then we have a = (ymin + ymax )/2 and may take = 1/[2(ymax - ymin )2 ]. y+ 1-y1 . · entropy loss: Y = [0; 1] and (y1 , y2 ) = y1 log y1 (1 - y1 ) log 1-y2 Note that 2 (0, 1) = (1, 0) = +. Then we have a = 1/2 and may take = 1. · exponential (or AdaBoost) loss: Y = [-ymax ; ymax ] and (y1 , y2 ) = e-y1 y2 . Then we 2 have a = 0 and may take = e-ymax . · logit loss: Y = [-ymax ; ymax ] and (y1 , y2 ) = log(1 + e-y1 y2 ). Then we have a = 0 and 2 may take = e-ymax . Progressive indirect mixture rule. Let G be a finite reference set of prediction functions. Under the previous assumptions, the only known algorithms satisfying (1) are the progressive indirect mixture rules defined below. For any i {0, . . . , n}, the cumulative loss suffered by the prediction function g on the first i pairs of input-output is i (g ) i j =1 [Yj , g (Xj )], 1 Indeed, if- denotes the unction e- y , from Jensen's inequality, for any probability distribution, f 1 1 1 E y(Y ) = E log (Y ) - log E (Y ) - log (EY ) = y(EY ). 2 where by convention we take 0 0. Let denote the uniform distribution on G . We define the probability distribution i on G as ^ i e-i · ^ g -i (g ) -i (g ) equivalently for any g G , i (g ) = e ^ /( ). This distribution concentrates G e ^ on functions having low cumulative loss up to time i. For any i {0, . . . , n}, let hi be a prediction function such that (x, y ) Z 1 ^ [y, hi (x)] - log Egi e- ^ [y,g(x)] . (2) The progressive indirect mixture rule produces the prediction function n^ 1 gpim = n+1 i=0 hi . ^ ^ From the uniform exp-concavity assumption and Jensen's inequality, hi does exist since one may ^ i = Eg g . This particular choice leads to the progressive mixture rule, for which the take h ^i predicted output for any x X is 1 g g n -i (g ) ge gpm (x) = ^ (x). -i (g ) G n+1 i=0 e G Consequently, any result that holds for any progressive indirect mixture rule in particular holds for the progressive mixture rule. The idea of a progressive mean of estimators has been introduced by Barron ([2]) in the context of density estimation with Kullback-Leibler loss. The form gpm is due to Catoni ([3]). It was also ^ independently proposed in [4]. The study of this procedure was made in density estimation and least square regression in [5, 6, 7, 8]. Results for general losses can be found in [9, 10]. Finally, the progressive indirect mixture rule is inspired by the work of Vovk, Haussler, Kivinen and Warmuth [11, 12, 13] on sequential prediction and was studied in the "batch" setting in [10]. Finally, in the upper bounds we state, e.g. Inequality (1), one should notice that there is no constant larger than 1 in front of mingG R(g ), as opposed to some existing upper bounds (e.g. [14]). This work really studies the behaviour of the excess risk, that is the random variable R(g ) - mingG R(g ). ^ The largest integer smaller or equal to the logarithm in base 2 of x is denoted by log2 x . 3 Expectation convergence rate The following theorem, whose proof is omitted, shows that the expectation convergence rate of any progressive indirect mixture rule is (i) at least (log |G |)/n and (ii) cannot be uniformly improved, even when we consider only probability distributions on Z for which the output has almost surely two symmetrical values (e.g. {-1;+1} classication with exponential or logit losses). Theorem 1 Any progressive indirect mixture rule satisfies ER(gpim ) min R(g ) + ^ g G log |G | (n+1) . Let y1 Y - {a} and d be a positive integer. There exists a set G of d prediction functions such that: for any learning algorithm, there exists a probability distribution generating the data for which · the output marginal is supported by 2a - y1 and y1 : P (Y {2a - y1 ; y1 }) = 1, 1 , n2 G ^ · ER(g ) min R(g ) + e-1 log+|1 | with sup [ (y1 , a) - (y1 , y )] > 0. g G y Y The second part of Theorem 1 has the same (log |G |)/n rate as the lower bounds obtained in sequential prediction ([12]). From the link between sequential predictions and our "batch" setting with i.i.d. data (see e.g. [10, Lemma 3]), upper bounds for sequential prediction lead to upper bounds for i.i.d. data, and lower bounds for i.i.d. data leads to lower bounds for sequential prediction. The converse of this last assertion is not true, so that the second part of Theorem 1 is not a consequence of the lower bounds of [12]. 3 The following theorem, whose( proof is also omitted, shows that for appropriate set G : (i) the empirical risk minimizer has a log |G |)/n expectation convergence rate, and (ii) any empirical risk minimizer and any of its penalized variants are really poor algorithms in our learning task since their ( expectation convergence rate cannot be faster than log |G |)/n (see [5, p.14] and [1] for results of the same spirit). This last point explains the interest we have in progressive mixture rules. Theorem 2 If B supy,y ,y Y [ (y, y ) - (y, y )] < +, then any empirical risk minimizer, which produces a prediction function germ in argmingG n , satisfies: ^ 2 log |G | ER(germ ) min R(g ) + B ^ . n g G Let y1 , y1 Y ]a; +[ and d be a positive integer. There exists a set G of d prediction functions ~ such that: for any learning algorithm producing a prediction function in G (e.g. germ ) there exists a ^ probability distribution generating the data for which · the output marginal is supported by 2a - y1 and y1 : P (Y {2a - y1 ; y1 }) = 1, l , ogn |G | 2 · ER(g ) min R(g ) + 8 ^ 2 with (y1 , 2a - y1 ) - (y1 , y1 ) > 0. ~ ~ g G The lower bound of Theorem 2 also says that one should not use cross-validation. This holds for the loss functions considered in this work, and not for, e.g., the classification loss: (y, y ) = 1y=y . 4 Deviation convergence rate The following theorem shows that the deviation convergence rate of any progressive indirect mix ture rule is (i) at least 1/ n and (ii) cannot be uniformly improved, even when we consider only probability distributions on Z for which the output has almost surely two symmetrical values (e.g. {-1;+1} classication with exponential or logit losses). Theorem 3 If B supy,y ,y Y [ (y, y ) - (y, y )] < +, then any progressive indirect mixture rule satisfies: for any > 0, with probability at least 1 - w.r.t. the training set distribution, we have 2 log(2 -1 ) l g |G R(gpim ) min R(g ) + B ^ + on+1|) n+1 ( g G Let y1 and y1 in Y ]a; +[ such that y1 is twice continuously differentiable on [a; y1 ] and ~ ~ y1 (y1 ) 0 and y1 (y1 ) > 0. Consider the prediction functions g1 y1 and g2 2a - y1 . ~ ~ ~ ~ For any training set size n large enough, there exist > 0 and a distribution generating the data such that · the output marginal is supported by y1 and 2a - y1 · with probability larger than , we have l R(gpim ) - ^ g {g1 ,g2 } min R(g ) c og(e -1 ) n where c is a positive constant depending only on the loss function, the symmetry parameter a and the output values y1 and y1 . ~ Proof 1 See Section 5. This result is quite surprising since it gives an example of an algorithm which is optimal in terms of expectation convergence rate and for which the deviation convergence rate is (significantly) worse than the expectation convergence rate. In fact, despite their popularity based on their unique expectation convergence rate, the progressive mixture rules are not good algorithms since a long argument essentially based on convexity shows that the following algorithm has both expectation and deviation convergence rate of order 1/n. Let 4 germ be the minimizer of the empirical risk among functions in G . Let g be the minimizer of the ^ ~ ^ empirical risk in the star G = gG [g ; germ ]. The algorithm producing g satisfies for some C > 0, ^ ~ for any > 0, with probability at least 1 - w.r.t. the training set distribution, we have R(g ) min R(g ) + C log( ~ g G -1 |G |) . n This algorithm has also the benefit of being parameter-free. On the contrary, in practice, one will have recourse to cross-validation to tune the parameter of the progressive mixture rule. To summarize, to predict as well as the best prediction function in a given set G , one should not restrain the algorithm to produce its prediction function among the set G . The progressive mixture satisfies this principle since it produces a prediction function in the convex hull of G . This allows to achieve (log |G |)/n convergence rates in expectation. It seems that the convex hull is a too large set to consider since (i) the progressive mixture rule does not achieve log( -1 |G |)/n deviation convergence rate, and (ii) this rate is achievable by an algorithm producing prediction functions only on the edges of the convex hull. Future work might look at whether one can transpose this algorithm to the sequential prediction setting, in which, up to now, the algorithms to predict as well as the best expert were dominated by algorithms producing a mixture expert inside the convex hull of the set of experts. 5 Proof of Theorem 3 5.1 Proof of the upper bound Let Zn+1 = (Xn+1 , Yn+1 ) be an input-output pair independent from the training set Z1 , . . . , Zn and with the same distribution P . From the convexity of y (y, y ), we have n 1 ^ (3) R(gpim ) n+1 i=0 R(hi ). ^ Now from [15, Theorem 1] (see also [16, Proposition 1]), for any > 0, with probability at least 1 - , we have l + n n 1 1 ^ ^ (4) R(hi ) 1 Yi+1 , h(Xi+1 ) B og ( - ) n+1 i =0 n+1 i=0 2(n+1) Using [12, Theorem 3.8] and the exp-concavity assumption, we have + n n ^ Yi+1 , h(Xi+1 ) min Yi+1 , g (Xi+1 ) i =0 g G i=0 log |G | (5) (6) Let g argminG R. By Hoeffding's inequality, with probability at least 1 - , we have ~ l n og( -1 1 Yi+1 , g (Xi+1 ) ~ R(g ) + B 2(n+1)) ~ i=0 n+1 Merging (3), (4), (5) and (6), with probability at least 1 - 2 , we get + log |G | n 1 ~ R(gpim ) n+1 i=0 Yi+1 , g (Xi+1 ) ^ (n+1) + B 2 log( -1 ) log |G | R (g ) + B ~ + (n+1) . n+1 5.2 Sketch of the proof of the lower bound l og( -1 ) 2(n+1) We cannot use standard tools like Assouad's argument (see e.g. [17, Theorem 14.6]) because if it were possible, it would mean that the lower bound would hold for any algorithm and in particular for g , and this is false. To prove that any progressive indirect mixture rule have no fast exponential ~ deviation inequalities, we will show that on some event with not too small probability, for most of the i in {0, . . . , n}, -i concentrates on the wrong function. The proof is organized as follows. First we define the probability distribution for which we will prove that the progressive indirect mixture rules cannot have fast deviation convergence rates. Then we define the event on which the progressive indirect mixture rules do not perform well. We lower bound the probability of this excursion event. Finally we conclude by lower bounding R(gpim ) on ^ the excursion event. Before starting the proof, note that from the "well behaved at center" and exp-concavity assumptions, for any y Y ]a; +[, on a neighborhood of a, we have: y ( y)2 and since y(a) < 0, y1 and y1 exist. Due to limited space, some technical computations have been removed. ~ 5 5.2.1 Probability distribution generating the data and first consequences. Let ]0; 1] be a parameter to be tuned later. We consider a distribution generating the data such that the output distribution satisfies for any x X P (Y = y1 |X = x) = (1 + )/2 = 1 - P (Y = y2 |X = x), where y2 = 2a - y1 . Let y2 = 2a - y1 . From the symmetry and admissibility assumptions, we have ~ ~ (y2 , y2 ) = (y1 , y1 ) < (y1 , y2 ) = (y2 , y1 ). Introduce ~ ~ ~ ~ We have R(g2 ) - R(g1 ) = 1+ 2[ (y1 , y2 ) - (y1 , y1 ) > 0. ~ ~ 1- 2[ (7) (8) (y1 , y2 ) - (y1 , y1 )] + ~ ~ (y2 , y2 ) - (y2 , y1 )] = . ~ ~ Therefore g1 is the best prediction function in {g1 , g2 } for the distribution we have chosen. Introduce Wj 1Yj =y1 - 1Yj =y2 and Si i =1 Wj . For any i {1, . . . , n}, we have j i i i (g2 ) - i (g1 ) = j =1 [ (Yj , y2 ) - (Yj , y1 )] = j =1 Wj = Si ~ ~ The weight given by the Gibbs distribution -i to the function g1 is -i (g1 ) = e-i (g1 ) e-i (g1 ) +e-i (g2 ) 1 1+e[i (g1 )-i (g2 )] 1 1+e-Si = = . (9) 5.2.2 An excursion event on which the progressive indirect mixture rules will not perform well. Equality (9) leads us to consider the event: , E = i { , . . . , n}, Si - with the smallest integer larger than (log n)/( ) such that n - is even (for convenience). We have log n log n (10) + 2. The event E can be seen as an excursion event of the random walk defined through the random variables Wj = 1Yj =y1 - 1Yj =y2 , j {1, . . . , n}, which are equal to +1 with probability (1 + )/2 and -1 with probability (1 - )/2. From (9), on the event E , for any i { , . . . , n}, we have -i (g1 ) 1 n+1 . (11) This means that -i concentrates on the wrong function, i.e. the function g2 having larger risk (see (8)). 5.2.3 Lower bound of the probability of the excursion event. This requires to look at the probability that a slightly shifted random walk in the integer space has a very long excursion above a certain threshold. To lower bound this probability, we will first look at the non-shifted random walk. Then we will see that for small enough shift parameter, probabilities of shifted random walk events are close to the ones associated to the non-shifted random walk. Let N be a positive integer. Let 1 , . . . , N be N independent Rademacher variables: P(i = +1) = P(i = -1) = 1/2. Let si i =1 i be the sum of the first i Rademacher variables. We j start with the following lemma for sums of Rademacher variables (proof omitted). Lemma 1 Let m and t be positive integers. We have s m = t P ax sk t; sN = t; N - t m 2P < sN t + m 1kN ( 12) Let 1 , . . . , N be N independent shifted Rademacher variables to the extent that P(i = +1) = (1 + )/2 = 1 - P(i = -1). These random variables satisfy the following key lemma (proof omitted) 6 Lemma 2 For any set A integer, we have ( 1, . . . , N ) {-1, 1}n : N i=1 i M w here M is a positive ( 13) ( 1- M /2 1 N /2 ( P 1 , . . . , N ) A - 2 P 1 , . . . , N ) A 1+ We may now lower bound the probability of the excursion event E . Let M be an integer larger than . We still use Wj 1Yj =y1 - 1Yj =y2 for j {1, . . . , n}. By using Lemma 2 with N = n - 2 , we obtain = W i P(E ) P 1 = -1, . . . , W2 = -1; 2 < i n, j =2 +1 Wj 1- 2 i j P i {1, . . . , N } j =1 2 B 1- 2 1- M /2 1 N| - 2 2 P sN | M ; i {1, . . . , N } si 2 1+ y using Lemma 1, since M , the r.h.s. probability can be lower bounded, and after some computations, we obtain 1 2 1- M / 2 1 N (14) P(E ) - - 2 2 [P(sN = ) - P(sN = M )] 2 1+ where we recall that have the order of log n, N = n - 2 has the order of n and that > 0 and M have to be appropriately chosen. To control the probabilities of the r.h.s., we use Stirling's formula nn e-n 2 n e1/(12n+1) < n! < nn e-n 2 n e1/(12n) , and get for any s [0; N ] such that N - s even, 21 -N 1 s 2 -N s2 P(sN = s) - N2 N 1+ s N (15) 2 s e- 6(N +s) - 6(N -s) s 1 1 (16) and similarly P(sN = s) 2 N 1 - s2 N2 -N 1 2 s -N s 1+ N 2 e 12N +1 . 1 (17) These computations and (14) leads us to take M as the smallest integelarger than n such that r n - M is even. Indeed,2from (10), (16) and (17), we obtain limn+ n[P(sN = ) - P(sN = 1 > M )] = c, where c = / - e-1/2 0. Therefore for n large enough we have 1 2 1- M / 2 1 N (18) P(E ) 2cn - - 2 2 2 1+ The last two terms of the r.h.s. of (18) leads us to take of order 1/ n up to possibly a logarithmic term. We obtain the following lower bound on the excursion probability C Lemma 3 If = 0 (log n)/n with C0 a positive constant, then for any large enough n, P(E ) 1 nC0 . 5.2.4 Behavior of the progressive indirect mixture rule on the excursion event. n^ F( om now on, we work on the event E . We have gpim = ( i=0 hi )/(n + 1). We still use r ^ y1 , y2 ) - (y1 , y1 ) = (y2 , y1 ) - (y2 , y2 ). On the event E , for any x X and any i { , . . . , n}, ~ ~ ~ ~ ^ by definition of hi , we have I 1 ~ ^ [y2 , hi (x)] - (y2 , y2 ) - log Ee -i e-{ [y2 ,g(x)]- (y2 ,y2 )} ~ g 1 - - = - log 1 + (1 - e )-i (g2 ) 1 1 - log - (1 - e- ) n+1 ^ n particular, for any n large enough, we have [y2 , hi (x)] - (y2 , y2 ) C n-1 , with C > 0 ~ independent from . From the convexity of the function y (y2 , y ) and by Jensen's inequality, we obtain n g 1 ^ [y2 , gpim (x)] - (y2 , y2 ) n+1 i=0 [y2 , hi (x)] - (y2 , y2 ) n 1 + C n-1 < C1 lon n ^ ~ ~ + 7 for some constant C1 > 0 independent from . Let us now prove that for n large enough, we have l g (19) y2 gpim (x) y2 + C on n y1 , ~ ^ ~ ~ with C > 0 independent from . From (19), we obtain = R(gpim ) - R(g1 ) = ^ = ( + 1- ( ^ ~) ^ ~ yy1 , gpim ) - (y1 , y1+ 1- 2 y y2 , gpim ) - (y2 , y1 ) (gpim ) - y1 (y1 ) ^ ~ (2a - gpim ) - y1 (y2 ) ^ ~ 1 2 1 + 1- - + y1 (gpim ) - y1 (y2 ) ^ ~ + y1 (2a - gpim ) - y1 (y1 ) ^ ~ 2 - (gpim - y2 )| y1 (y2 )| ^ l~ ~ 1+ 2 1+ 2 1+ 2 og n n, - C2 (20) ( with C2 independent from . We may take = 2C2 log n)/n and obtain: for n large enough, l on the event E , we have R(gpim ) - R(g1 ) C og n/n. From Lemma 3, this inequality holds ^ with probability at least 1/nC4 for some C4 > 0. To conclude, for any n large enough, there exists l > 0 s.t. with probability at least , R(gpim ) - R(g1 ) c og(e - ) . where c is a positive constant ^ n depending only on the loss function, the symmetry parameter a and the output values y1 and y1 . ~ 1 References ´ [1] G. Lecue. Suboptimality of penalized empirical risk minimization in classification. In Proceedings of the 20th annual conference on Computational Learning Theory, 2007. [2] A. Barron. Are bayes rules consistent in information? In T.M. Cover and B. Gopinath, editors, Open Problems in Communication and Computation, pages 85­91. Springer, 1987. [3] O. Catoni. A mixture approach to universal model selection. preprint LMENS 97-30, Available from http://www.dma.ens.fr/edition/preprints/Index.97.html, 1997. [4] A. Barron and Y. Yang. Information-theoretic determination of minimax rates of convergence. Ann. Stat., 27(5):1564­1599, 1999. [5] O. Catoni. Universal aggregation rules with exact bias bound. Preprint n.510, http://www.proba. jussieu.fr/mathdoc/preprints/index.html\#1999, 1999. [6] G. Blanchard. The progressive mixture estimator for regression trees. Ann. Inst. Henri Poincare, Probab. ´ Stat., 35(6):793­820, 1999. [7] Y. Yang. Combining different procedures for adaptive regression. Journal of multivariate analysis, 74:135­161, 2000. [8] F. Bunea and A. Nobel. Sequential procedures for aggregating arbitrary estimators of a conditional mean, 2005. Technical report. [9] A. Juditsky, P. Rigollet, and A.B. Tsybakov. Learning by mirror averaging. Preprint n.1034, Laboratoire ´ ` ´ ´ de Probabilites et Modeles Aleatoires, Universites Paris 6 and Paris 7, 2005. [10] J.-Y. Audibert. A randomized online learning algorithm for better variance control. In Proceedings of the 19th annual conference on Computational Learning Theory, pages 392­407, 2006. [11] V.G. Vovk. Aggregating strategies. In Proceedings of the 3rd annual workshop on Computational Learning Theory, pages 371­386, 1990. [12] D. Haussler, J. Kivinen, and M. K. Warmuth. Sequential prediction of individual sequences under general loss functions. IEEE Trans. on Information Theory, 44(5):1906­1925, 1998. [13] V.G. Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences, pages 153­173, 1998. [14] M. Wegkamp. Model selection in nonparametric regression. Ann. Stat., 31(1):252­273, 2003. [15] T. Zhang. Data dependent concentration bounds for sequential prediction algorithms. In Proceedings of the 18th annual conference on Computational Learning Theory, pages 173­187, 2005. [16] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050­2057, 2004. ¨ [17] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996. 8