Prediction with Exp ert Advice for the Brier Game

Vladimir Vovk vovk@cs.rhul.ac.uk Fedor Zhdanov fedor@cs.rhul.ac.uk Computer Learning Research Centre, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK

Abstract
We show that the Brier game of prediction is mixable and find the optimal learning rate and substitution function for it. The resulting prediction algorithm is applied to predict results of football and tennis matches. The theoretical performance guarantee turns out to be rather tight on these data sets, especially in the case of the more extensive tennis data.

In this paper we concentrate on the square loss function. In the binary case, its mixability was demonstrated in Vovk (1990). There are two natural directions in which this result could be generalized: Regression: observations are real numbers (squareloss regression is a standard problem in statistics). Classification: observations take values in a finite set (this leads to the "Brier game", to be defined below, a standard way of measuring the quality of predictions in meteorology and other applied fields: see, e.g., Dawid, 1986). The mixability of the square loss function in the case of observations belonging to a bounded interval of real numbers was demonstrated in Haussler et al. (1998); Haussler et al.'s algorithm was simplified in Vovk (2001). Surprisingly, the case of square-loss non-binary classification has never been analysed in the framework of prediction with expert advice. The purpose of this paper is to fill this gap. The full version (Vovk & Zhdanov, 2008) of this paper is available on arXiv.

1. Introduction
The paradigm of prediction with expert advice was introduced in the late 1980s (see, e.g., Littlestone & Warmuth, 1994, Cesa-Bianchi et al., 1997) and has been applied to various loss functions; see CesaBianchi and Lugosi (2006) for a recent book-length review. An especially important class of loss functions is that of "mixable" ones, for which the learner's loss can be made as small as the best expert's loss plus a constant (depending on the number of experts). It is known (Haussler et al., 1998; Vovk, 1998) that the optimal additive constant is attained by the "strong aggregating algorithm" proposed in Vovk (1990) (we use the adjective "strong" to distinguish it from the "weak aggregating algorithm" of Kalnishkan & Vyugin, 2005). There are several important loss functions that have been shown to be mixable and for which the optimal additive constant has been found. The prime examples in the case of binary observations are the log loss function and the square loss function. The log loss function, whose mixability is obvious, has been explored extensively, along with its important generalizations, the Kullback­Leibler divergence and Cover's loss function.
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).

2. Prediction Algorithm and Loss Bound
A game of prediction consists of three components: the observation space , the decision space , and the loss function  :  з   R. In this paper we are interested in the following Brier game (Brier, 1950):  is a finite and non-empty set,  := P () is the set of all probability measures on , and o 2 ( ,  ) = ( {o} -  {o}) ,


where   P () is the probability measure concentrated at  :  { } = 1 and  {o} = 0 for o =  . (For example, if  = {1, 2, 3},  = 1,  {1} = 1/2,  {2} = 1/4, and  {3} = 1/4, ( ,  ) = (1/2 - 1)2 + (1/4 - 0)2 + (1/4 - 0)2 = 3/8.)


Prediction with Exp ert Advice for the Brier Game

The game of prediction is being played repeatedly by a learner having access to decisions made by a pool of experts, which leads to the following prediction protocol: Proto col 1 Prediction with expert advice L0 := 0. Lk := 0, k = 1, . . . , K . 0 for N = 1, 2, . . . do k Expert k announces N  , k = 1, . . . , K . Learner announces N  . Reality announces N  . LN := LN -1 + (N , N ). k Lk := Lk -1 + (N , N ), k = 1, . . . , K . N N end for At each step of Protocol 1 Learner is given K experts' advice and is required to come up with his own decision; LN is his cumulative loss over the first N steps, and Lk is the k th expert's cumulative loss over the N first N steps. In the case of the Brier game, the decisions are probability forecasts for the next observation. An optimal (in the sense of Theorem 1 below) strategy for Learner in prediction with expert advice for the Brier game is given by the strong aggregating algorithm. For each expert k , the algorithm maintains its weight wk , constantly slashing the weights of less successful experts. Algorithm 1 Strong aggregating algorithm for the Brier game k w0 := 1, k = 1, . . . , K . for N = 1, 2, . . . do k Read the Experts' predictions N , k = 1, . . . , K . K k k Set GN ( ) := - ln k=1 wN -1 e-(,N ) ,   .  + Solve  (s - GN ( )) = 2 in s  R. Set N { } := (s - GN ( ))+ /2,   . Output prediction N  P (). Read observation N . k k k wN := wN -1 e-(N ,N ) . end for The algorithm will be derived in Section 5. The following result (to be proved in Section 4) gives a performance guarantee for it that cannot be improved by any other prediction algorithm. Theorem 1. Using Algorithm 1 as Learner's strategy in Protocol 1 for the Brier game guarantees that LN 
k=1,...,K

have a strategy guaranteeing LN  for al l N = 1, 2, . . . .
k=1,...,K

min

Lk + A N

(2)

3. Exp erimental Results
In our first empirical study of Algorithm 1 we use historical data about 6416 matches in various English football league competitions, namely: the Premier League (the pinnacle of the English football system), the Football League Championship, Football League One, Football League Two, the Football Conference. Our data, provided by Football-Data, cover two full seasons, 2005/2006 and 2006/2007, and part of the 2007/2008 season (which ends in May shortly after the paper submission deadline). The matches are sorted first by date and then by league. In the terminology of our prediction protocol, the outcome of each match is the observation, taking one of three possible values, "home win", "draw", or "away win"; we will encode the possible values as 1, 2, and 3. For each match we have forecasts made by a range of bookmakers. We chose eight bookmakers for which we have enough data over a long period of time, namely Bet365, Bet&Win, Gamebookers, Interwetten, Ladbrokes, Sportingbet, Stan James, and VC Bet. (And the seasons mentioned above were chosen because the forecasts of these bookmakers are available for them.) A probability forecast for the next observation is essentially a vector (p1 , p2 , p3 ) consisting of positive numbers summing to 1. The bookmakers do not announce these numbers directly; instead, they quote three betting odds, a1 , a2 , and a3 . Each number ai is the amount which the bookmaker undertakes to pay out to a client betting on outcome i per unit stake in the event that i happens (the stake itself is never returned to the bettor, which makes all betting odds greater than 1; i.e., the odds are announced according to the "continental" rather than "traditional" system). The inverse value 1/ai , i  {1, 2, 3}, can be interpreted as the bookmaker's quoted probability for the observation i. The bookmaker's quoted probabilities are usually slightly (because of the competition with other bookmakers) in his favour: the sum 1/a1 + 1/a2 + 1/a3 exceeds 1 by the amount called the overround (at most 0.15 in the vast ma jority of cases). We used pi := (1) 1/ai , 1/a1 + 1/a2 + 1/a3 i = 1 , 2, 3, (3)

min

Lk + ln K N

for al l N = 1, 2, . . . . If A < ln K , Learner does not

as the bookmaker's forecasts; it is clear that p1 + p2 + p3 = 1.


Prediction with Exp ert Advice for the Brier Game

The results of applying Algorithm 1 to the football data, with 8 experts and 3 possible observations, are shown in Figure 1. Let Lk be the cumulative loss of N Expert k , k = 1, . . . , 8, over the first N matches and LN be the corresponding number for Algorithm 1 (i.e., we essentially continue to use the notation of Theorem 1). The dashed line corresponding to Expert k shows the excess loss N  Lk - LN of Expert k over AlN gorithm 1. The excess loss can be negative, but from Theorem 1 we know that it cannot be less than - ln 8; this lower bound is also shown in Figure 1. Finally, the thick line (the positive part of the x axis) is drawn for comparison: this is the excess loss of Algorithm 1 over itself. We can see that at each moment in time the algorithm's cumulative loss is fairly close to the cumulative loss of the best expert (at that time; the best expert keeps changing over the time).

The results in Figure 2 are presented in the same way as in Figure 1. Typical values of the overround are below 0.1.

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 -1 -2 0 1000

Theoretical bound Algorithm 1 Experts

14 12 10 8 6 4 2 0 -2 0 1000 2000 3000 4000 5000 6000 Theoretical bound Algorithm 1 Experts

2000

3000

4000

5000

6000

7000

8000

9000 10000

Figure 2. The difference between the cumulative loss of each of the 4 bookmakers and of Algorithm 1 on the tennis data. Now the theoretical bound is - ln 4.

In both Figure 1 and Figure 2 the cumulative loss of Algorithm 1 is close to the cumulative loss of the best expert, despite the fact that some of the experts perform poorly. The theoretical bound is not hopelessly loose for the football data and is rather tight for the tennis data. The pictures look exactly the same when Algorithm 1 is applied in the more realistic manner where the weights wk are not updated over the matches that are played simultaneously. Our second empirical study (Figure 2) is about binary prediction, and so the algorithm of Vovk (1990) could have also been used (and would have given similar results). We included it since we are not aware of any empirical studies even for the binary case. Other popular algorithms for prediction with expert advice that could be used instead of Algorithm 1 in our empirical studies are Kivinen and Warmuth's (1999) Weighted Average Algorithm (WAA) and Freund and Schapire's (1997) Hedge algorithm (HA). The performance guarantees for these two algorithms are much weaker than the optimal (1), especially in the case of the HA (even if the loss bound given in Freund & Schapire, 1997, is replaced by the stronger bound given in Vovk, 1998, Example 7). The weak performance guarantees show in the empirical performance of the algorithms. For the football data the maximal difference between the cumulative loss of both the WAA and the HA and the cumulative loss of the best

Figure 1. The difference between the cumulative loss of each of the 8 bookmakers (experts) and of Algorithm 1 on the football data. The theoretical lower bound - ln 8 from Theorem 1 is also shown.

Figure 2 shows the results of another empirical study, involving data about a large number of tennis tournaments in 2004, 2005, 2006, and 2007, with the total number of matches 10,087. The tournaments include, e.g., Australian Open, French Open, Wimbledon, and US Open; the data is provided by TennisData. The matches are sorted by date, then by tournament. The data contain information about the winner of each match and the betting odds of 4 bookmakers for his/her win and for the opponent's win. Therefore, now there are two possible observations (player 1's win and player 2's win). There are four bookmakers: Bet365, Centrebet, Expekt, and Pinnacle Sports.


Prediction with Exp ert Advice for the Brier Game

expert is about twice as large as that for Algorithm 1 (and so is approximately equal to the optimal bound ln K given by (1)). For the tennis data the maximal difference for the WAA is about three times as large as for Algorithm 1, and for the HA it is about twice as large; therefore, both algorithms violate the optimal bound ln K . For further details, see Vovk and Zhdanov (2008). The data used for producing Figures 1 and 2 can be downloaded from http://vovk.net/ICML2008.

(towards the origin or away from it). The standard argument (as in Thorpe, 1979, Chapter 12, Theorem 6) based on the continuity of the smallest principal curvature shows that the  -exponential superprediction set is bulging away from the origin for small enough  : indeed, since it is true at some point, it is true everywhere on the surface. By the continuity in  this is also true for all  < 1. Now, since the  -exponential superprediction set is convex for all  < 1, it is also convex for  = 1. Let us now check that the Gauss­Kronecker curvature of the  -exponential superprediction surface is always positive when  < 1 and is sometimes negative when  > 1 (the rest of the proof, an elaboration of the above argument, will be easy). Set n := ||; without loss of generality we assume  = {1, . . . , n}. A convenient parametric representation of the  exponential superprediction surface is    

4. Pro of of Theorem 1
This proof will use some basic notions of elementary differential geometry, especially those connected with the Gauss­Kronecker curvature of surfaces. (The use of curvature in this kind of results is standard: see, e.g., Vovk, 1990, and Haussler et al., 1998.) All definitions that we will need can be found in, e.g., Thorpe, 1979. A vector f  R (understood to be a function f :   R) is a superprediction if there is    such that, for all   , ( ,  )  f ( ); the set  of all superpredictions is the superprediction set. For each learning rate  > 0, let  : R  (0, ) be the homeomorphism defined by  (f ) :     e-f () , f  R .

          , =      12 n-1 2 n2 n-1  x e-((u ) +ЗЗЗ+(u -1) +(u ) )  12 n-1 2 n 2 xn e-((u ) +ЗЗЗ+(u ) +(u -1) )

x1 x2 . . .

e-((u -1) +(u ) +ЗЗЗ+(u ) ) 12 2 2 n2 e-((u ) +(u -1) +ЗЗЗ+(u ) ) . . .

1

2

22

n2

(4)

The image  () of the superprediction set will be called the  -exponential superprediction set. It is known that LN 
k=1,...,K

min

Lk + N

ln K 

where u1 , . . . , un-1 are the coordinates on the surface, u1 , . . . , un-1  (0, 1) sub ject to u1 + З З З un-1 < 1, and un is a shorthand for 1 - u1 - З З З - un-1 . The derivative of (4) in u1 is    u1  

can be guaranteed if and only if the  -exponential superprediction set is convex (part "if " for all K and part "only if " for K   are proved in Vovk, 1998; part "only if " for all K is proved by Chris Watkins, and the details can be found in, e.g., Vovk, 2007, Appendix). Comparing this with (1) and (2) we can see that we are required to prove that З  () is convex when   1; З  () is not convex when  > 1. Define the  -exponential superprediction surface to be the part of the boundary of the  -exponential superprediction set  () lying inside (0, ) . The idea of the proof is to check that, for all  < 1, the Gauss­ Kronecker curvature of this surface is nowhere vanishing. Even when this is done, however, there is still uncertainty as to in which direction the surface is bulging

1 2 22 n-1 2 n2 (un - u1 + 1)e-((u -1) +(u ) +ЗЗЗ+(u ) +(u ) ) 12 2 2 n-1 2 n2    (un - u1 )e-((u ) +(u -1) +ЗЗЗ+(u ) +(u ) )    .   . .     12 22 n-1 2 n2  (un - u1 )e-((u ) +(u ) +ЗЗЗ+(u -1) +(u ) )  12 22 n-1 2 n 2 (un - u1 - 1)e-((u ) +(u ) +ЗЗЗ+(u ) +(u -1) )   1 (un - u1 + 1)e2u 2    (un - u1 )e2u    . , .    . n n-1   (u - u1 )e2u  n (un - u1 - 1)e2u

        = 2 з   n-1  x xn

x1 x2 . . .


Prediction with Exp ert Advice for the Brier Game

the derivative in u2 is   1  1 (un - u2 )e2u x  x2   (un - u2 + 1)e2u2     .  .  . , .   . 2 .   u  n-1   n-1  x   (un - u2 )e2u  n 2 2 u n xn (u - u - 1)e and so on, up to   1  1 (un - un-1 )e2u x  x2   (un - un-1 )e2u2      .  .  . , .   . n-1  .    u n xn-1  (u - un-1 + 1)e2un-1   n xn (un - un-1 - 1)e2u all coefficients of proportionality being equal and positive. Let us set v i,j := (un - ui )e2u and wi := (un - ui ), for purely typographical reasons. A normal vector to the surface can be found as Z := e
1
j

( -1)n (un - u1 - 1) + (-1)n+1 (u1 - u2 ) = + (-1)n+1 (u1 - u3 ) + З З З + (-1)n+1 (u1 - un-1 ) e - 2 u
1

e-2u (-1)n з (2 = u + u3 + З З З + un ) - (n - 1)u1 - 1 -e-2u (-1)n nu1  u1 e-2u
1 1

1

(5)

(with a positive coefficient of proportionality, e2 , in the first ; the third equality follows from the expansion of the determinant along the last column and then along the first row). Similarly, the coefficient in front of ei is proportional (with the same coefficient of proportionality) to i ui e-2u for i = 2, . . . , n -1; indeed, the (n -1)з(n- 1) determinant representing the coefficient in front of ei can be reduced to the form analogous to (5) by moving the ith row to the top. The coefficient in front of en is proportional to = w1 +1 w1 ЗЗЗ w1 w1 2 2 2 w w + 1 ЗЗЗ w w2 n . . . .. . . . e - 2 u . . . . n-2 n-2 n-2 w w ЗЗЗ w +1 wn-2 n-1 n-1 n-1 n-1 w w ЗЗЗ w w +1 1 = 1 0 ЗЗЗ 0 w 0 1 ЗЗЗ 0 w2 n . . . . .. . . . . e-2u . . . . . n-2 0 0 ЗЗЗ 1 w -1 -1 З З З -1 wn-1 + 1 1 = 0 ЗЗЗ 0 un - u1 0 1 ЗЗЗ 0 un - u2 n. n . . . e-2u . . . . . . nun e-2u .. . . . 00 00 ЗЗЗ ЗЗЗ 1 un - un-2 0 nun

v 1, 1 + e 2 u . . . v n-1,1

1

ЗЗЗ ЗЗЗ .. . ЗЗЗ

en-1 v 1,n-1 . . . v n-1,n-1 -1 n + e 2 u

en n v 1,n - e2u . . . v n-1,n - e2u

.

n

The coefficient in front of e1 is the (n - 1) з (n - 1) determinant  v 1, 2 n ЗЗЗ v 1,n-1 v 1,n - e2u 2 n v 2,2 + e2u З З З v 2,n-1 v 2,n - e2u . . . .. . . . . . . . v n-1,2 v n-1,n-1 -1 v n-1,n - e2u n + e 2 u w1 = ЗЗЗ w1 w1 - 1 2 2 ЗЗЗ w w2 - 1 1 w +1 e - 2 u . . . .. . . . . . . . wn-1 З З З wn-1 + 1 wn-1 - 1 1 = 1 ЗЗЗ 1 w1 - 1 2 1 ЗЗЗ 1 w2 - 1 1 w3 - 1 e-2u 1 2 З З З 1 . . .. . . .. . .. .. . . ЗЗЗ 1 ЗЗЗ 1 0 ЗЗЗ 0 1 ЗЗЗ . . .. .. . .. 0 0 ЗЗЗ 1 11 ЗЗЗ 2 wn-1 - 1 1 un - u1 - 1 0 u1 - u2 0 u1 - u3 . . . . . . 1 1 u - un-1
n

(with the coefficient of proportionality e2 (-1)n-1 ). The Gauss­Kronecker curvature at the point with coordinates (u1 , . . . , un-1 ) is proportional (with a positive coefficient of proportionality, possibly depending on the point) to ( ZT
 u1

. . .

Z = (Thorpe, 1979, Chapter 12, Theorem 5, with ing for transposition).
i

ZT  u n- 1 T

6)

T

stand-

e - 2 u

1

Set v i := (1 - 2 ui )e-2u and wi = ui e-2u , again for typographical reasons. A straightforward calculation allows us to rewrite determinant (6) (ignoring the

i


Prediction with Exp ert Advice for the Brier Game

positive coefficient ((-1)n-1 ne2 )n ) as v1 0 . . . 0 w1 0 v2 . . . ЗЗЗ ЗЗЗ .. . 0 0 . . . -v n -v n . . . 

have t1 + З З З + tn-1 > n - 2, and so all of t1 , . . . , tn-1 are positive; this shows that (9) is indeed true. Let us prove (10). Since t1 З З З tn > 0, all of t1 , . . . , tn are positive (if two of them were negative, the sum t1 + З З З + tn would be less than n - 2; cf. (8)). Therefore, 1 1 + ЗЗЗ + > 1 + З З З + 1 = n. n t1 tn
times

0 З З З v n-1 -v n w2 З З З wn-1 wn 1 =1 - 2 u 0 ЗЗЗ 0 0 1 - 2 u2 З З З 0 . . . .. . . . . . . . 0 0 З З З 1 - 2 un-1 u1 u2 ЗЗЗ un-1

- 1 + 2 u n - 1 + 2 u n - 1 + 2 u un
n

u1 (1 - 2 u2 )(1 - 2 u3 ) З З З (1 - 2 un ) + u2 (1 - 2 u1 )(1 - 2 u3 ) З З З (1 - 2 un ) + З З З + un (1 - 2 u1 )(1 - 2 u2 ) З З З (1 - 2 un-1 ) (7)

To establish (9) it remains to prove (11). Suppose, without loss of generality, that t1 > 0, t2 > 0,. . . , tn-1 > 0, and tn < 0. Since the function t  (0, 1]  1/t is convex, we can also assume, without loss of generality, t1 = З З З = tn-2 = 1. Then tn-1 + tn > 0, and so 1 1 + < 0; tn-1 tn therefore, 1 1 1 1 + ЗЗЗ + + + < n - 2 < n. t1 tn-2 tn-1 tn Finally, let us check that the positivity of the Gauss­ Kronecker curvature implies the convexity of the  exponential superprediction set, for   1. Because of the continuity of the  -exponential superprediction surface in  we can and will assume, without loss of generality, that  < 1. The  -exponential superprediction surface will be oriented by choosing the normal vector field directed towards the origin; this can be done since  1   2 u 1   1 - 2 u 1  x e -u e .  .   . .    . , Z   . .   , (12) . . xn e 2 u
n

(with a positive coefficient of proportionality; to avoid calculation of the parities of various permutations, the reader might prefer to prove the last equality by induction in n, expanding the last determinant along the first column). Our goal is to show that the last expression in (7) is positive when  < 1 but can be negative when  > 1. If  > 1, set u1 = u2 := 1/2 and u3 = З З З = un := 0. The last expression in (7) becomes negative. Therefore, the  -exponential superprediction set is not convex (Thorpe, 1979, Chapter 13, Theorem 1). It remains to consider the case  < 1. Set ti := 1 - 2 ui , i = 1, . . . , n; the constraints on the ti are - 1 < 1 - 2 < ti < 1, i = 1, . . . , n, t1 + З З З + tn = n - 2 > n - 2. (8) Our goal is to prove (1 - t1 )t2 t3 З З З tn + З З З + (1 - tn )t1 t2 З З З tn-1 > 0, i.e., t2 t3 З З З tn + З З З + t1 t2 З З З tn-1 > nt1 З З З tn . This reduces to 1 1 + ЗЗЗ + >n t1 tn if t1 З З З tn > 0, and to 1 1 + ЗЗЗ + <n t1 tn (11) (10) (9)

-un e-2u

n

with the first coefficient of proportionality positive (cf. (4) and the bottom row of the first determinant in (7)), and the scalar product of the two vectors in (12) is always negative. Let us first check that the smallest principal curvature k1 = k1 (u1 , . . . , un-1 ,  ) of the  -exponential superprediction surface is always positive (among the arguments of k1 we list not only the coordinates u1 , . . . , un-1 of a point on the surface (4) but also the learning rate   (0, 1)). At least at some (u1 , . . . , un-1 ,  ) the value of k1 (u1 , . . . , un-1 ,  ) is positive: take a sufficiently small  and the point on the surface (4) at which the maximum of x1 + З З З + xn is attained (the point of the  -exponential superprediction set at which the maximum is attained will

if t1 З З З tn < 0. The remaining case is where some of the ti are zero; for concreteness, let tn = 0. By (8) we


Prediction with Exp ert Advice for the Brier Game

lie on the surface since the maximum is attained at (x1 , . . . , xn ) = (1, . . . , 1) when  = 0). Therefore, for all (u1 , . . . , un-1 ,  ) the value of k1 (u1 , . . . , un-1 ,  ) is positive: if k1 had different signs at two points in the set ( u1 , . . . , un-1 ,  ) | u1  (0, 1), . . . , un-1  (0, 1), , u1 + З З З + un-1 < 1,   (0, 1) (13)

(cf. (4)) satisfying 1  l1 , . . . , n  ln . (15)

we could connect these points by a continuous curve lying completely inside (13); at some point on the curve, k1 would be zero, in contradiction to the positivity of the Gauss­Kronecker curvature k1 З З З kn-1 . Now it is easy to show that the  -exponential superprediction set is convex. Suppose there are two points A and B on the  -exponential superprediction surface such that the interval [A, B ] contains points outside the  -exponential superprediction set. The intersection of the plane OAB , where O is the origin, with the  -exponential superprediction surface is a planar curve; the curvature of this curve at the point between A and B closest to the origin will be negative (with the curve oriented by directing the normal vector field towards the origin), contradicting the positivity of k1 at that point and Meusnier's theorem (cf. (12)).

Now suppose we are given a generalized prediction (L1 , . . . , Ln )T computed by the APA from an unnormalized distribution on the experts; in other words, we are given   L1 l1 + c .  .   . = .  . . Ln ln + c for some c  R. To find (14) satisfying (15) we can first find the largest t  R such that (L1 - t, . . . , Ln - t)T is still a superprediction and then find (14) satisfying 1  L1 - t, . . . , n  Ln - t. (16)

Since t  c, it is clear that (1 , . . . , n )T will also satisfy the required (15). Prop osition 1. Define s  R by the requirement in
=1

(s - Li )+ = 2.

(17)

The unique solution to the optimization problem t  max under the constraints (16) with 1 , . . . , n as in (14) wil l be ui = (s - Li )+ , i = 1, . . . , n, 2 t = s - 1 - (u1 )2 - З З З - (un )2 . (18) (19)

5. Derivation of the Prediction Algorithm
To achieve the loss bound (1) in Theorem 1 Learner can use, as discussed earlier, the strong aggregating algorithm (see, e.g., Vovk, 2001, Section 2.1, (15)) with  = 1. In this section we will find a substitution function for the strong aggregating algorithm for the Brier game with   1, which is the only component of the algorithm not described explicitly in Vovk (2001). Our substitution function will not require that its input, the generalized prediction, should be computed from the normalized distribution (wk )K 1 on the exk= perts; this is a valuable feature for generalizations to an infinite number of experts (as demonstrated in, e.g., Vovk, 2001, Appendix A.1). Suppose that we are given a generalized prediction (l1 , . . . , ln )T computed by the aggregating pseudoalgorithm from a normalized distribution on the experts. Since (l1 , . . . , ln )T is a superprediction (remember that we are assuming   1), we are only required to find a permitted prediction  1  1 (u - 1)2 + (u2 )2 + З З З + (un )2  2  (u1 )2 + (u2 - 1)2 + З З З + (un )2     . =  (14) . . .   . . n (u ) + (u ) + З З З + (u - 1)
12 22 n 2

There exists a unique s satisfying (17) since the lefthand side of (17) is a continuous, increasing (strictly increasing when positive) and unbounded above function of s. The substitution function is given by (18). Proof of Proposition 1. Let us denote the ui and t defined by (18) and (19) as ui and t, respectively. To see that they satisfy the constraints (16), notice that the ith constraint can be spelt out as (u1 )2 + З З З + (un )2 - 2ui + 1  Li - t, which immediately follows from (18) and (19). As a by-product, we can see that the inequality becomes an equality, i.e., t = Li - 1 + 2ui - (u1 )2 - З З З - (un )2 , for all i with ui > 0. We can rewrite (16) as   t  L1 - 1 + 2u1 - (u1 )2 - З З З - (un )2 ,  . . .   n t  Ln - 1 + 2u - (u1 )2 - З З З - (un )2 , (20)

(21)


Prediction with Exp ert Advice for the Brier Game

and our goal is to prove that these inequalities imply t < t (unless u1 = u1 , . . . , un = un ). Choose ui (necessarily ui > 0 unless u1 = u1 , . . . , un = un ; in the latter case, however, we can, and will, also choose ui > 0) for which i := ui - ui is maximal. Then every value of t satisfying (21) will also satisfy t  Li - 1 + 2u -
i

How to use expert advice. Journal of the Association for Computing Machinery, 44, 427­485. Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge, England: Cambridge University Press. Dawid, A. P. (1986). Probability forecasting. In S. Kotz, N. L. Johnson and C. B. Read (Eds.), Encyclopedia of statistical sciences, vol. 7, 210­218. New York: Wiley.

jn
=1

(u )

j2

= Li - 1 + 2ui - 2 i -  Li - 1 + 2ui - jn
=1

jn
=1

(uj )2 + 2 jn
=1

jn
=1

juj -

jn
=1

2 j

(uj )2 -

2  t, j

Freund, Y., & Schapire, R. E. (1997). A decisiontheoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119­139. Haussler, D., Kivinen, J., & Warmuth, M. K. (1998). Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory, 44, 1906­1925. Kalnishkan, Y., & Vyugin, M. V. (2005). The Weak Aggregating Algorithm and weak mixability. Proceedings of the Eighteenth Annual Conference on Learning Theory (pp. 188­203). Berlin: Springer. Kivinen, J., & Warmuth, M. K. (1999). Averaging expert predictions. Proceedings of the Fourth European Conference on Computational Learning Theory (pp. 153­167). Berlin: Springer. Littlestone, N., & Warmuth, M. K. (1994). The Weighted Ma jority Algorithm. Information and Computation, 108, 212­261. Thorpe, J. A. (1979). Elementary topics in differential geometry. New York: Springer. Vovk, V. (1990). Aggregating strategies. Proceedings of the Third Annual Workshop on Computational Learning Theory (pp. 371­383). San Mateo, CA: Morgan Kaufmann. Vovk, V. (1998). A game of prediction with expert advice. Journal of Computer and System Sciences, 56, 153­173. Vovk, V. (2001). Competitive on-line statistics. International Statistical Review, 69, 213­248. Vovk, V. (2007). Defensive forecasting for optimal prediction with expert advice (Technical Report arXiv:0708.1503 [cs.LG]). arXiv.org e-Print archive. Vovk, V., & Zhdanov, F. (2008). Prediction with expert advice for the Brier game (Technical Report arXiv:0708.2502v2 [cs.LG]). arXiv.org e-Print archive.

with the last  following from (20) and becoming < when not all uj coincide with uj . The detailed description of the resulting prediction algorithm was given as Algorithm 1 in Section 2. As discussed, that algorithm uses the generalized prediction GN ( ) computed from unnormalized weights.

6. Conclusion
In this paper we only considered the simplest prediction problem for the Brier game: competing with a finite pool of experts. In the case of square-loss regression, it is possible to find efficient closed-form prediction algorithms competitive with linear functions (see, e.g., Cesa-Bianchi & Lugosi, 2006, Chapter 11). Such algorithms can often be "kernelized" to obtain prediction algorithms competitive with reproducing kernel Hilbert spaces of prediction rules. This would be an appealing research programme in the case of the Brier game as well. Acknowledgments We are grateful to Football-Data and Tennis-Data for providing access to the data used in this paper. This work was partly supported by EPSRC (grant EP/F002998/1). Comments by Alexey Chernov, Alex Gammerman, Yuri Kalnishkan, and anonymous referees have helped us improve the presentation.

References
Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78, 1­3. Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., & Warmuth, M. K. (1997).