Surrogate Regret Bounds for Proper Losses Mark D. Reid Australian National University, Canberra, 0200, Australia Robert C. Williamson Australian National University and NICTA, Canberra, 0200, Australia M ARK .R EID @ ANU . EDU . AU B OB .W ILLIAMSON @ ANU . EDU . AU Abstract We present tight surrogate regret bounds for the class of proper (i.e., Fisher consistent) losses. The bounds generalise the margin-based bounds due to Bartlett et al. (2006). The proof uses Taylor's theorem and leads to new representations for loss and regret and a simple proof of the integral representation of proper losses. We also present a different formulation of a duality result of Bregman divergences which leads to a simple demonstration of the convexity of composite losses using canonical link functions. Our main contribution is to bring together and generalise several existing results from a diversity of sources and state them in a language more familiar to machine learning researchers. Importantly, we also show how they are all based upon the integral Taylor expansion of the conditional Bayes risk for (proper) surrogate losses.1 Given a proper loss, the function expressing its conditional Bayes risk is trivially computed and our results suggest that much can be gained from making it and its integral representation central objects of study in learning theory. The main theorems regarding proper losses are stated in §1.2 below. Their proofs are provided in the remainder of the paper. The surrogate regret bound and convex link results are discussed in §4 and §5, respectively. These rely on the integral representation discussed in §3 which in turn relies on Taylor's theorem and results from convex analysis presented in §2. 1.1. Probability Estimation and Proper Losses We write x y := min(x, y), x y := max(x, y) and p = 1 if p is true and p = 0 otherwise. The generalised b function (·) is defined by a (x)f (x)dx = f (0) when f is continuous at 0 and a < 0 < b. The unit step U (x) = x (t)dt. The real numbers are denoted R and the non- negative reals R+ . Random variables are written in sansserif font: X, Y. Sets are in calligraphic font: X (the "input" space). Vectors are written in bold font: Rm . We will often take expectations (E) over the random variable X. The resulting quantities will be written in blackboard bold: L and B. Proper losses (defined below) are denoted by ; their associated conditional and full risks by L and L. The lower bound on quantities with an intrinsic lower bound (e.g. the Bayes optimal loss) are written with an ^ underbar: L, L. Estimated quantities are hatted: . We will call a (M -measurable) function : X [0, 1] a ^ This is similar to an insight by Liese and Vajda (2006) who use a Taylor expansion to study integral representations of f divergences. 1 1. Introduction A surrogate loss function is a loss function which is not exactly what one wishes to minimise but is easier to work with. Convex surrogate losses are used in place of the 0-1 loss. Bartlett et al. (2006) have derived tight bounds on the regret with respect to the 0-1 loss 0-1 when one knows the regret with respect to a convex margin loss. Surrogate regret bounds can be viewed as a type of reduction (Beygelzimer et al., 2008) between learning problems where one directly uses the hypothesis obtained by empirically minimising the surrogate loss for the original prediction problem. Margin losses are a subset of all possible loss functions. We consider the general class of proper losses (defined below) which subsumes almost all margin losses. Proper losses ­ also known as proper scoring rules (Buja et al., 2005; Gneiting & Raftery, 2007) ­ are the "right" sort of loss to use when one studies class probability estimation. One advantage is that they have integral representations in terms of cost-weighted losses. This representation is central to the derivation of the main result of the paper (Theorem 3). Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Surrogate Regret Bounds for Proper Losses class probability estimator. Overloading the notation, we also use = (x) [0, 1] to denote an estimate for a ^ ^ specific observation x X. Much of the subsequent arguments rely on this conditional perspective. Estimate quality is assessed using a loss function : {0, 1} × [0, 1] R and the loss of the estimate with ^ respect to the label y {0, 1} is denoted (y, ). Through^ out, we adopt the convention that there is no cost for perfect prediction, that is, (0, 0) = (1, 1) = 0. We require the following technical assumption2 : 1.2. Main Results The following important property of proper losses is originally attributed to Savage (1971) and is re-derived in Section 2. It shows that the point-wise Bayes risk of a loss is necessarily concave and how a proper loss can be derived from any concave function. Theorem 1 The point-wise Bayes risk L : [0, 1] R for a proper loss is concave function. Conversely, given a concave function : [0, 1] R there exists a proper loss satisfying L() = (), [0, 1]. Furthermore, the point-wise risk satisfies L(, ) = L(^) - (^ - )L (^). ^ (4) lim (1, ) = lim (1 - ) (0, ) = 0. 0 1 (1) If [0, 1] is the probability of observing the label y = 1 the point-wise risk of the estimate [0, 1] is defined as ^ the -average of the point-wise loss for : ^ L(, ) := EY [ (Y, )] = (0, )(1-)+ (1, ). (2) ^ ^ ^ ^ Here, Y is a shorthand for labels being drawn from a Bernoulli distribution with parameter . When : X [0, 1] is an observation-conditional density, taking the M average of the point-wise risk gives the (full) risk of the estimator : ^ L(, , M ) := EXM [L((X), (X))]. ^ ^ The convention of using , L and L for the loss, point-wise and full risk is used throughout this paper. A natural measure of the difficulty of a task is its minimal achievable risk, or Bayes risk: L(, M ) := where [0, 1] L() := inf L(, ) ^ [0,1] ^ [0,1]X ^ In Section 3 we re-derive an old result (see Schervish (1989) and Gneiting and Raftery (2007) for its history) that characterises proper losses in terms of their "weight functions" ­ a dual relationship analogous to that between a function and its Fourier transform. Theorem 2 The function : {0, 1}×[0, 1] R is a proper loss iff for each [0, 1] and y {0, 1} ^ 1 (y, ) = ^ 0 ^ c (y, ) w(c) dc, (5) where the "weight function" w(c) = -L (c) 0. Our regret bound is stated next. Its proof in Section 4 shows it is directly implied by the previous results. Theorem 3 Let c0 (0, 1) and let Bc0 (, ) denote the ^ point-wise regret for the cost-weighted loss c0 . Suppose it is known that Bc0 (, ) = . Then the point-wise regret ^ B(, ) for any proper surrogate loss with point-wise risk ^ L and Bayes risk L satisfies B(, ) (c0 , ) (c0 , -), ^ where (c0 , ) := L(c0 ) - L(c0 + ) + L (c0 ). Furthermore (6) is tight. By restricting attention to the case when c0 = 1 and sym2 metric losses we obtain, as a corollary, a result similar to that presented by Barlett et al. (2006) for surrogate margin losses since B 1 is easily shown to be half the 0-1 regret. 2 Corollary 4 If L is symmetric ­ that is, L( 1 - c) = L(c - 2 1 1 ^ 2 ) for c [0, 1] ­ and B 2 (, ) = , then 1 B(, ) L( 2 ) - L( 1 + ). ^ 2 inf L(, , M ) = EXM [L((X))] , ^ is the point-wise Bayes risk. Note the use of the underline on L and L to indicate that the corresponding functions L and L are minimised. If is to be interpreted as ^ an estimate of the true positive class probability then it is desirable to require that L(, ) be minimised by ^ = for all [0, 1]. Losses that satisfy this constraint ^ are said to be Fisher consistent and are known as proper losses (Buja et al., 2005). That is, a proper loss satisfies L() = L(, ) for all [0, 1]. The cost-weighted losses are a family of losses parametrised by a false positive cost c [0, 1] that defines a loss for y {0, 1} and [0, 1] ^ by ^ c (y, ) = c y = 0 c + (1 - c) y = 1 < c . (3) ^ ^ (6) 2 This is equivalent to the conditions in (Savage, 1971) and (Schervish, 1989). Surrogate Regret Bounds for Proper Losses In practice, the probability is often not estimated di^ rectly. Instead, some (linearly) parameterised hypothesis ^ h : X R is used and converted to a probability estimate ^ = -1 (h) using a link function . Computationally, it ^ ^ ^ is useful if the composite risk L(, -1 (h)) is convex in h. The following theorem shows one can always "convexify" a proper loss. ^ Theorem 5 Let = -L . Then for h : X R the com^ ^ posite risk L(, -1 (h)) is convex in h. Buja et al. (2005) call this the "canonical link". As shown in Section 5, our derivation of this result is a direct consequence of the integral representation of L and its Legendre-Fenchel dual. This result is can be immediately obtained upon substitution of t into (7) since the conditions in (8) restrict the limits of integration to the interval (s0 , s) [a, b] or (s, s0 ) [a, b] and reverse of the sign of (s - t) when s < s0 . Central to our analysis is the observation that the (generalised) second derivative of a convex is everywhere nonnegative. This means the Taylor expansion in (9) can be seen as the sum of the linear component (s0 )+ (s0 )(s- s0 ) and a weighted combination of piece-wise linear terms t . As we shall see, the weight function w(t) = (t) that determines the contribution of each "primitive" t characterises many of the important properties of the function . We now show how Theorem 1 can be derived from the integral Taylor expansion of L. We believe the proof here to be more transparent than earlier proofs. Proof (Theorem 1) For the forward implication, assume is a proper loss. By definition, the point-wise Bayes risk L() = inf L(, ) which, for each [0, 1] is just the ^ ^ lower envelope of the lines L(, ) = (1 - ) (0, ) + ^ ^ (1, ) and thus L is concave. ^ The properness of means L() = L(, ) and the ^ derivative of L is 0 when = . Hence ^ L(, ) ^ ^ = (1 - ) (0, ) + (1, ) = 0 (10) = ^ 2. Convexity and Taylor Expansions In this paper we are primarily concerned with convex and concave functions defined on subsets of the real line. A central tool in their analysis is the integral form of their Taylor expansion. Here, and denote the first and second derivatives of respectively. Theorem 6 (Taylor's Theorem) Let [s0 , s] be a closed interval of R and let be a real-valued function over [s0 , s]. Then s (s) = (s0 ) + (s0 )(s - s0 ) + s0 (s - t) (t) dt. (7) for all [0, 1]. Expanding L () using the chain rule gives L () = (1 - ) (0, ) - (0, ) + (1, ) + (1, ) = (1, ) - (0, ) + (1, ) + (1 - ) (0, ) = (1, ) - (0, ) where the last terms are 0 by (10). Thus ^ L(^) + ( - )L (^) = = = (1- ) (0, ) + (1, ) + (- )[ (1, ) - (0, )] ^ ^ ^ ^ ^ ^ ^ (1 - - + ) (0, ) + (^ + - ) (1, ) ^ ^ ^ ^ ^ ^ (1 - ) (0, ) + (1, ), ^ ^ The classical statement of Taylor's theorem requires to be twice differentiable, however we will use an extension that allows for generalised functions in a manner similar to Liese and Vajda (2006). For example, if (s) = max(s, 0) Taylor's theorem holds when is taken to be to be the unit step function (s) = U (s), and (s) to be (s). The argument s in the above theorem can be awkward to work with as it appears in the limits of the integral. The following corollary removes this problem by replacing the integrand in (7) with a piece-wise linear function (s - t) s0 < t s t (s, s0 ) := (t - s) s < t s0 (8) 0 otherwise. This is a piece-wise linear and convex in s for each s0 , t [a, b]. Corollary 7 (Integral Representation) Let : [a, b] R be a general function. Then, for all s, s0 [a, b] we have b which is the definition of L(, ). The result holds at the ^ endpoints by the assumptions in (1). Conversely, now suppose is a concave function and let (y, ) = (^) + (y - ) (^). The Taylor expansion of ^ ^ is () and so = (^) + ( - ) (^) + ^ ^ ( - c) (c) dc (s) = (s0 ) + (s0 )(s-s0 ) + a t (s, s0 ) (t) dt. (9) L(, ) = (^) - ^ ^ ( - c) (c) dc () = L() Surrogate Regret Bounds for Proper Losses because the concavity of means 0 and so the integral term is positive and is minimised to 0 when = . ^ This shows is proper, completing the proof. representation in (7) reveals that Bregman divergences between real numbers can be defined as the non-linear part ­ or "Tayl" ­ of the Taylor expansion of . That is, for all s, s0 R s 2.1. Regret and Bregman Divergence Bregman divergences are a generalisation of the notion of distances between points. In this section we recount an observation by Buja et al. (2005) that point-wise regret for a proper surrogate loss is a Bregman divergence. We begin with some definitions.3 Given a differentiable4 convex function : S R defined on the convex set S Rd and two points s0 , s S the Bregman divergence of s from s0 is defined B (s, s0 ) := (s) - (s0 ) - s - s0 , (s0 ) . (11) B (s, s0 ) = s0 (s - t) (t)dt, (12) since = and the inner product is simply multiplication over the reals. From Theorem 1 we know that -L is convex for any proper surrogate loss . Thus, setting = -L in the definition of Bregman divergence gives us B (, ) ^ = = ^ -L() + L(^) + ( - )L (^) (13) L(, ) - L() ^ where (s0 ) is the gradient of at s0 . A concise summary of many of the properties of Bregman divergences is given by Banerjee et al. (2005, Appendix A). In particular, Bregman divergences always satisfy B (s, s0 ) 0 and B (s0 , s0 ) = 0 for all s, s0 S, regardless of the choice of . They are not always metrics, however, as they do not always satisfy the triangle inequality and their symmetry depends on the choice of . When S = [0, 1], the concavity of L (see Theorem 1) means (s) = -L(s) is convex and so induces the Bregman divergence5 B (s, s0 ) = -L(s)+L(s0 )-(s0-s)L (s0 ) = L(s, s0 )-L(s). The converse also holds. Given a Bregman divergence B over S = [0, 1] the convexity of guarantees that L = - is concave. Thus, we know that there is a proper loss with Bayes risk equal to -. As noted by Buja et al. (2005, §19), the difference B (, ) = L(, ) - L() ^ ^ is also known as the point-wise regret of the estimate ^ w.r.t. . The corresponding (full) regret is the M -average point-wise regret B(, , M ) := EXM [B ((X), (X))] = L(, ) - L(). ^ ^ ^ When S = R and is twice differentiable, comparing the definition of a Bregman divergence in (11) to the integral Any terms related to convex analysis not explicitly defined can be found in (Hiriart-Urruty & Lemar´ chal, 2001). e 4 Technically, need only be differentiable on the relative interior ri(S) of S. We omit this requirement for simplicity and because it is not relevant to this discussion. 5 Technically, S is the 2-simplex {(s1 , s2 ) [0, 1]2 : s1 +s2 = 1} but we identify s [0, 1] with (s, 1 - s). 3 by (4) in Theorem 1, which is the point-wise regret for the loss . 3. Weighted Integral Representations We now consider a representation of proper losses in terms of primitive losses that originates with Shuford et al. (1966) and has been studied in some depth by Buja et al. (2005). It is also a special case of the recent integral representation obtained by Lambert et al. (2008) that generalises the earlier results to scoring rules for general properties of discrete distributions. Our contribution is to show how this representation is a direct consequence of the Taylor expansion of a proper loss's Bayes risk. This shows the elementary nature of this representation and highlights the importance of the class of cost-weighted misclassification losses (3). Intuitively, a cost-weighted loss turns an estimate [0, 1] into the ^ classification c and assigns a cost if this disagrees ^ with the true classification y. Remaining consistent with our nomenclature for general losses, we will use Lc and Lc to denote the cost-weighted point-wise risk and full risk associated with each cost-weighted loss c . The following lemma is needed for the proof of the main result. Lemma 8 For each c (0, 1), c is a proper loss and its Bayes risk Lc and regret Bc satisfy Lc () Bc (, ) ^ for , [0, 1]. ^ Proof By definition, for each [0, 1] Lc () = = [0,1] ^ = ((1 - c)) ((1 - )c) (14) (15) = | - c| c < ^ ^ inf (1 - )c c + (1 - c) < c ^ ^ [0,1] ^ inf (1 - c) + (c - ) c , ^ Surrogate Regret Bounds for Proper Losses since < c = 1- c . Since (c-) is negative if and ^ ^ only if > c the infimum is obtained by having c = ^ 1 if and only if c, that is, by letting = . In this case, ^ when = c we have Lc () = c(1 - ) = Lc (, ) ^ and when = < c we have Lc () = (1-c) = Lc (, ) ^ and so is proper. When c < we see that (1 - c) = - c ^ c - c = (1 - )c and so Lc () = (1 - c). Also, by definition, Lc (, ) = (1 - )c in this case so ^ Bc (, ) = (1 - )c - (1 - c) = c - = | - c|. ^ Similarly, when c < , Lc () = (1 - )c and ^ Lc (, ) = (1 - c) so Bc (, ) = - c = | - c| proving ^ ^ the result. We are now able to give the proof of Theorem 2. Proof (Theorem 2) We first assume is a proper loss so that L(, ) = EY [ (Y, )] and L() = L(, ). Ex^ ^ panding L() about [0, 1] using Corollary 7 yields ^ 1 Each of the Lc are proper by Lemma 8 and so are minimised when = . Since w(c) 0 this must also be ^ sufficient to minimise L. The linearity of expectation and regret provide some weighted integral representations for other quantities. Corollary 9 Let be a proper surrogate loss and let L(, ) be its point-wise risk, B(, ) = L(, ) - L() ^ ^ ^ its point-wise regret and w = -L . Then for , [0, 1] ^ 1 L(, ) ^ B(, ) ^ = 0 Lc (, ) w(c) dc ^ ^ (18) (19) = ^ | - c| w(c) dc. When (x) is the probability that x X is positive and : X [0, 1] a predictor, its risk and regret satisfy ^ 1 L(, , M ) ^ c (, ) L (c) dc ^ B(, , M ) ^ (16) = 0 1 Lc (, , M ) w(c) dc ^ Bc (, , M ) w(c) dc. ^ 0 L() = L(^) + ( - )L (^) + ^ 0 1 = = L(, ) + ^ 0 c (, ) L (c) dc ^ by Theorem 1. The generalised function w(c) = -L (c) 0 by the concavity of L. Rearranging (16) gives 1 Proof Equation 18 is the result of taking expectations on both sides of (5). Equation 19 is obtained by integrating the expression for Bc in Lemma 8. The remaining two expressions are the result of applying EM [·] to the weighted integral representations for L and B. L(, ) = L() + ^ 0 c (, ) w(c) dc. ^ The definition of L in (2) implies L(y, ) = (y, ) for ^ ^ y {0, 1} and so 1 4. Surrogate Regret Bounds Proper losses for probability estimation and surrogate margin losses (Bartlett et al., 2006) for classification are closely related. Buja et al. (2005) note that "the surrogate criteria of classification are exactly the primary criteria of class probability estimation" and that most commonly used surrogate margin losses are just proper scores mapped from [0, 1] to R via a link function. The main exceptions are hinge losses6 which means SVMs are "the only case that truly bypasses estimation of class probabilities and directly aims at classification" (Buja et al., 2005, pg. 4). However, ^ commonly used margin losses of the form (y h(x)) are a more restrictive class than proper losses since, as Buja et al. (2005, §23) note, "[t]his dependence on the margin limits all theory and practice to a symmetric treatment of class 0 and class 1". Suppose for some fixed c0 (0, 1) that Bc0 (, ) = . ^ What can be said concerning the value of the regret B(, ) ^ for an arbitrary but proper surrogate loss ? Theorem 3 provides an answer to this question in the form of a surrogate 6 (y, ) = L(y) + ^ 0 c (y, ) w(c) dc, ^ (17) where c (y, ) = c < y (y - c) + y c < (c - y), ^ ^ ^ which is equal to the definition of c in (3) since the left (resp. right) term is only non-zero when y = 1 (resp. y = 0). Observe that L(0) = L(1) = 0 since L(0) = L(0, 0) = (0, 0) = 0 by assumption, and similarly for L(1). This shows that (17) is equivalent to (5), completing the forward direction of the theorem. If we now assume the function w 0 is given and defined as in (5) then it suffices to show L() = L(, ). First note that 1 L(, ) ^ = EY 0 1 ^ c (Y, ) w(c) dc = 0 Lc (, ) w(c) dc. ^ And powers of absolute divergence |y - r| for = 2. Surrogate Regret Bounds for Proper Losses loss bound. Such bounds are of practical importance as the losses Lc0 are hard to optimise and rearranging the bounds provide a guarantee that minimising the surrogate loss (and hence the surrogate regret) will minimise the Lc0 loss. We now provide an elementary proof of these bounds. Proof (Theorem 3) Let B be the conditional regret associated with some arbitrary proper loss and suppose that we know the cost-weighted regret Bc0 (, ) = . By ^ Lemma 8, this implies that = - c0 when c0 < ^ and = c0 + when c0 < . In the first case we have ^ c0 and = c0 + and so ^ B(, ) ^ = = ^ c0 + 4.1. Related Work Surrogate loss bounds have garnered increasing interest in the machine learning community (Zhang, 2004b; Bartlett et al., 2006; Steinwart, 2007). All of the recent work has been in terms of margin losses of the form ^ ^ ^ L (, h) = (h) + (1 - )(-h). As Buja et al. (2005) discuss, such margin losses can not capture the richness of all possible proper losses. Bartlett ^ et al. (2006) prove that for any h ^ ^ L0-1 (, h) - L0-1 () L (, h) - L (), ~ where = ~ is the LF biconjugate of , 1+ 2 -H 1+ 2 , B(c0 - , ) ^ c0 + (c0 + - c) w(c) dc (c0 + - c) w(c) dc c0 ~ () = H - H() = L () and H - () = inf by (19) and the assumption that c0 . Note that this ^ bound is achieved for = c0 and so is tight. Thus, using ^ w(c) = -L (c) and integrating by parts gives B(, ) - (c0 + - c)L (c) ^ = as required. The proof of the second case, when c0 < proceeds ^ identically. Corollary 4 is obtained directly by substituting = noting the symmetry of L implies L ( 1 ) = 0. 2 1 2 c0 + - c0 c0 + (() + (1 - )(-)) L (c)dc c0 : (2-1)0 L (c0 ) - L(c0 + ) + L(c0 ) is the optimal conditional risk under the constraint that the sign of the argument disagrees with 2 - 1. We will consider two examples and show that the bounds we obtain with the above result match those obtained with Theorem 3. ^ Exponential Loss Consider the link h = (^) = ^ 1 1 log 1-^ with corresponding inverse link = 1+e-2h . ^ ^ 2 Buja et al. (2005) showed that this link function combined with exponential margin loss () = e- results in a proper loss L(, ) = ^ Hence L() = L(, ) = 2 (1 - ) 1 and from Corollary 4 we obtain that if B 2 (, ) = then ^ and The bounds in Theorem 3 can be inverted so as to guarantee the minimisation of a cost-weighted loss via the minimisation of a surrogate loss. Corollary 10 Minimising B(, ) with respect to min^ ^ imises Bc (, ) for each c (0, 1). ^ Proof To see this, let (c0 , ) := (c0 , ) = -L (c0 + ) + L (c0 ). Since L is concave, L is non-increasing and hence L (c0 + ) L (c0 ) and so (c0 , ) 0 and therefore (c0 , ) is nondecreasing and thus invertible (although there may be non-uniqueness at points where (c0 , ) is constant in ). This invertibility means minimising B(, ) w.r.t. , ^ ^ minimises the bound on Bc (, ). ^ 1- ^ ^ 1 2 + (1 - ) ^ 1- ^ 1 2 . B(, ) 1 - ^ 1 - 42 . This matches the result presented by Bartlett et al. (2006) 1 1 upon noting that B 2 (, ) measures the loss in terms of 2 ^ 0-1 1. and they used =2 2 Truncated Quadratic Loss Consider the margin loss ^ ^ (h) = (1 + h 0)2 = (2^ 0)2 with link function ^ h(^) = 2^ - 1. One can show that L() = 4(1 - ) and from Corollary 4 the regret bound B(, ) 42 . This ^ This shows that proper surrogate losses are surrogates for the entire family of cost-sensitive losses c , not just 0-1 loss 1 (i.e., the case where c = 2 ). Surrogate Regret Bounds for Proper Losses matches the result of Bartlett et al. (2006) when again it is noted we used 1 and they used 0-1 . 2 The Probing Reduction The Probing reduction (Langford & Zadrozny, 2005) shows how the square loss for class probability estimation can be bounded by an average costweighted regret. The weighted integral representation allows us to obtain a similar result for the regret for the square loss sq (y, ) := (y - )2 . Specifically, ^ ^ Bsq (, ) ^ = EXM [EY(X) [(Y - (X))2 ]] ^ 1 where we stress we have parametrised the Bregman divergence by the monotone function W , rather than by the convex function W as is traditional. Similarly, denote by LW the w-weighted conditional loss parametrised by W . We shall see below that our choice of parametrisation is felicitous. When a function f is suitably smooth, the LegendreFenchel (LF) dual of f can be expressed in terms of its derivative and inverse. Furthermore in this case (writing Df := f ) f = (Df )-1 . Thus with w, W , and W defined as above, W = (D(W ))-1 , W -1 = D(W ), W = W -1 . (21) = 2 0 EXM [Bc ((X), (X))] dc. ^ (20) To see this, note that for square loss Lsq (, ) = (1 - )2 + (1 - )^2 , ^ ^ and so Lsq () = Lsq (, ) = - 2 which means that w(c) = -Lsq (c) = 2. Now, the integral representation result means that the regret for square loss can be written 1 The following theorem is known (Zhang, 2004a) but as will be seen, stating it in terms of BW provides some advantages. Theorem 11 Let w, W , W and BW be as above. Then for all x, y [0, 1], BW (x, y) = BW -1 (W (y), W (x)). Proof Using the Legendre transform we have (22) Bsq (, ) = 2 ^ 0 Bc (, ) dc. ^ W (u) = u · W -1 (u) - W (W -1 (u)) W (W -1 (u)) = u · W -1 (u) - W (u). (23) Equivalently (using (21)) Taking expectations with respect to M on both sides gives the result in (20). 5. Convexity and Canonical Links Often in estimating one uses a parametric representation of : X [0,1] which has a natural scale not match^ ing [0, 1]. In such cases it is common to use a link function (McCullagh & Nelder, 1989). Traditionally one writes ^ = -1 (h) where -1 is the "inverse link" (and is of ^ ^ course the forward link). The function h : X R is the ^ ^ hypothesis. Often h = h is parametrised linearly in a parameter vector . In such a situation it is computationally ^ ^ convenient if L(, -1 (h)) is convex in h (which implies ^ is linear in ). Theorem 5 shows it is convex in when h that choosing = -L guarantees convexity. Its proof is aided by a slight change of notation. Consider the general representation of B(, ) presented ^ in (13). Set W := = -L, W := W and w := W . We consider w as a weight-function7 since the convexity of W implies W is monotone non-decreasing and thus w is non-negative. Rewriting (13) we have ^ BW (, ) = W () - W (^) - ( - )W (^), ^ 7 This weight function exactly corresponds to the weight functions used by Buja et al. (2005). W (W (u)) = u · W (u) - W (u). Thus substituting and then using (23) we have BW (x, W -1 (v)) = = = (24) W (x) - W (W -1 (v)) - (x-W -1 (v))·W (W -1 (v)) W (x) + W (v) - vW -1 (v) - (x - W -1 (v)) · v W (x) + W (v) - x · v. (25) Similarly (this time using (24)) we have BW -1 (v, W (x)) = = = W (v) - W (W (x)) - (v-W (x))·W -1 (W (x)) W (v) - xW (x) + W (x) - v · x + xW (x) W (v) + W (x) - v · x. (26) Comparing (25) and (26) we see that BW (x, W -1 (v)) = BW -1 (v, W (x)). Let y = W -1 (v). Thus substituting v = W (y) leads to (22). Surrogate Regret Bounds for Proper Losses We now give the proof of Theorem 5 which provides a simple sufficient condition for the composite loss to be convex ^ in h. It was previously shown (with a more intricate proof) by Buja et al. (2005). The result also corresponds to the notion of "matching loss" (Helmbold et al., 1999). Proof (Theorem 5) If the link = W = -L (and thus ^ ^ ^ = W -1 (h)) then BW (, ) = W () + W (h) - · h. ^ ^ by (25) and so we have that LW (, ) ^ = BW (, ) + L() ^ ^ ^ = W () + W (h) - · h - W () chine learning techniques -- reductions between prediction quality metrics. Preprint. Buja, A., Stuetzle, W., & Shen, Y. (2005). Loss functions for binary class probability estimation and classification: Structure and applications (Technical Report). University of Pennsylvania. Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359­378. Helmbold, D., Kivinen, J., & Warmuth, M. (1999). Relative loss bounds for single neurons. IEEE Transactions on Neural Networks, 10, 1291­1304. Hiriart-Urruty, J.-B., & Lemar´ chal, C. (2001). Fundamene tals of convex analysis. Berlin: Springer. Lambert, N., Pennock, D., & Shoham, Y. (2008). Eliciting properties of probability distributions. Proceedings of the ACM Conference on Electronic Commerce (pp. 129­ 138). Langford, J., & Zadrozny, B. (2005). Estimating class membership probabilities using classifier learners. Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (AISTAT'05). Liese, F., & Vajda, I. (2006). On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52, 4394­4412. McCullagh, P., & Nelder, J. (1989). Generalized linear models. Chapman & Hall/CRC. Reid, M. D., & Williamson, R. C. (2009). Information, divergence and risk for binary experiments. arXiv preprint arXiv:0901.0356v1, 89 pages. Savage, L. J. (1971). Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66, 783­801. Schervish, M. (1989). A general method for comparing probability assessors. The Annals of Statistics, 17, 1856­ 1879. Shuford, E., Albert, A., & Massengill, H. (1966). Admissible probability measurement procedures. Psychometrika, 31, 125­145. Steinwart, I. (2007). How to compare different loss functions and their risks. Constructive Approximation, 26, 225­287. Zhang, J. (2004a). Divergence function, duality, and convex analysis. Neural Computation, 16, 159­195. Zhang, T. (2004b). Statistical behaviour and consistency of classification methods based on convex risk minimization. Annals of Mathematical Statistics, 32, 56­134. ^ ^ which is just W (h) - · h. Its convexity follows from the fact that W is convex (since it is the LF dual of a convex function W ) and the overall expression is the sum of this and a linear term. 6. Conclusions We have: 1) developed new explicit tight regret bounds for general proper losses (not just margin losses); 2) developed explicit formulae for losses and regrets in terms of weighted representations; and 3) simplified the proofs of some classical but little known results (Savage's expansion, the Shuford integral representation and Buja's convexity of composite losses with canonical links). Importantly, all these results were derived using only elementary techniques ­ the most fundamental being Taylor's theorem. The elementary nature of our presentation highlights the conditional Bayes risk of a proper loss as a key object in learning theory and demonstrates the value of its integral representation. Further uses of this representation are given by Reid and Williamson (2009). 7. Acknowledgements This work was supported by the Australian Research Council. RW is also supported by NICTA which is funded by the Australian Government through Backing Australia's Ability. We thank the anonymous reviewers for their attention to detail and considered suggestions. References Banerjee, A., Merugu, S., Dhillon, I., & Ghosh, J. (2005). Clustering with bregman divergences. The Journal of Machine Learning Research, 6, 1705­1749. Bartlett, P., Jordan, M., & McAuliffe, J. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101, 138­156. Beygelzimer, A., Langford, J., & Zadrozny, B. (2008). Ma-