How SVMs can estimate quantiles and the median

Ingo Steinwart Information Sciences Group CCS-3 Los Alamos National Laboratory Los Alamos, NM 87545, USA ingo@lanl.gov

Andreas Christmann Department of Mathematics Vrije Universiteit Brussel B-1050 Brussels, Belgium andreas.christmann@vub.ac.be

Abstract
We investigate kernel-based quantile regression based on the pinball loss and support vector regression based on the -insensitive loss. Conditions are given which quarantee that the set of exact minimizers contains only one function. Some results about oracle inequalities and learning rates of these methods are presented.

1

Introduction

Let P be a distribution on X × Y , where X is an arbitrary set and Y  R is closed. The goal of quantile regression is then to estimate the conditional quantile, i.e., the set valued function t (  [  ,  F ,P (x) :=  R : P -, t] | x  and P t, ) | x 1- x  X, where   (0, 1) is a fixed constant and P( ˇ | x), x  X , is the (regular) conditional probability. For conceptual simplicity (though mathematically this is not necessary) we assume throughout this   paper that F ,P (x) consists of singletons, i.e., there exists a function f ,P : X  R, called the   conditional  -quantile function, such that F ,P (x) = {f ,P (x)}, x  X . Let us now consider the so-called  -pinball loss function L : R × R  [0, ) defined by L (y , t) :=  (y - t), where  (r) = ( - 1)r, if r < 0, and  (r) =  r, if r  0. Moreover, given a (measurable) X L (y , f (x)) dP(x, y ). function f : X  R we define the L -risk of f by RL ,P (f ) := ×Y   Now recall that f ,P minimizes the L -risk, i.e. RL ,P (f ,P ) = inf RL ,P (f ) =: R  ,P , where L the infimum is taken over all measurable functions f : X  R. Based on this observation several estimators minimizing a (modified) empirical L -risk were proposed (see [5] for a survey on both parametric and non-parametric methods) for situations where P is unknown, but i.i.d. samples D := ((x1 , y1 ), . . . , (xn , yn )) drawn from P are given. In particular, [6, 4, 10] proposed a support vector machine approach that finds a solution fD,  H of arg min  f 2 + H
f H n 1i L (yi , f (xi )) , n =1

(1)

where  > 0 is a regularization parameter and H is a reproducing kernel Hilbert space (RKHS) over X . Note that this optimization problem can be solved by considering the dual problem [4, 10], but since this technique is nowadays standard in machine learning we omit the details. Moreover, [10] contains an exhaustive empirical study as well some theoretical considerations. Empirical methods estimating quantiles with the help of the pinball loss typically obtain functions fD for which RL ,P (fD ) is close to R  ,P with high probability. However, in general this only L  implies that fD is close to f ,P in a very weak sense (see [7, Remark 3.18]), and hence there is so far only little justification for using fD as an estimate of the quantile function. Our goal is to address this issue by showing that under certain realistic assumptions on P we have an inequality of the form R   f - f ,P L1 (PX )  cP (2) L ,P (f ) - RL ,P .


We then use this inequality to establish an oracle inequality for SVMs defined by (1). In addition, we illustrate how this oracle inequality can be used to obtain learning rates and to justify a datadependent method for finding the hyper-parameter  and H . Finally, we generalize the methods for establishing (2) to investigate the role of in the -insensitive loss used in standard SVM regression.

2

Main results

In the following X is an arbitrary, non-empty set equipped with a  -algebra, and Y  R is a closed non-empty set. Given a distribution P on X × Y we further assume throughout this paper that the  -algebra on X is complete with respect to the marginal distribution PX of P, i.e., every subset of a PX -zero set is contained in the  -algebra. Since the latter can always be ensured by increasing the original  -algebra in a suitable manner we note that this is not a restriction at all. Definition 2.1 A distribution Q on R is said to have a  -quantile of type  > 0 if there exists a  -quantile t  R and a constant cQ > 0 such that for all s  [0, ] we have (  (  Q t , t + s) cQ s and Q t - s, t ) cQ s . (3) It is not difficult to see that a distribution Q having a  -quantile of some type  has a unique  quantile t . Moreover, distributions Q having a Lebesgue density hQ have a  -quantile of type  if hQ is bounded away from zero on [t - , t + ]. In this case we can use cQ := inf {hQ (t) : t  [t - , t + ]} in (3). These assumptions are general enough to cover many distributions used in parametric statistics. Examples are Gaussian, Student's t, and logistic distributions (with Y = R), Gamma and log-normal distributions (with Y = [0, )), and uniform and Beta distributions distributions (with Y = [0, 1]). The following definition describes distributions on X × Y whose conditional distributions P( ˇ |x), x  X , have the same  -quantile type . Definition 2.2 Let p  (0, ],   (0, 1), and  > 0 be real numbers. A distribution P on X × Y is said to have a p-average  -quantile of type , if P( ˇ |x) is PX -almost surely of  -quantile type  and the function b : X  (0, ) is defined by b(x) := cP( ˇ |x) , where cP( ˇ |x) is the constant in (3) satisfies b-1  Lp (PX ). Now we give some examples for distributions having p-average  -quantiles of type . Example 2.3 Let P be a distribution on X × R with marginal distribution PX/and regular condi( : y tional probability Qx -, y ] = 1/(1 + e-z ), y  R, where z := - m(x)  (x), m : X  R describes a location shift, and  : X  [ , 1/ ] describes a scale modification for some constant   (0, 1]. Let us further assume that the functions m and  are continuous. Thus Qx is a logistic distribution having a positive and bounded Lebesgue density hQx (y ) = e-z /(1 + e-z )2 , y  R.   The  -quantile function is t (x) := f ,Qx = m(x) +  (x) log( 1- ), x  X , and we can choose   cQx = inf {hQx (t) : t  [t (x) - , t (x) + ]}. Note that hQx (m(x) + y ) = hQx (m(x) - y ) for all y  R, and hQx (y ) is strictly decreasing for y  [m(x), ). Some algebra gives cQx = c , h = u1 (x) u2 (x) 1 where min Qx (t (x) - ), hQx (t (x) + ) min (1+u1 (x))2 , (1+u2 (x))2 , , 4 u1 (x) := 1- e-/(x) , u2 (x) := 1- e/(x) and c, > 0 can be chosen independent of x,   because  (x)  [ , 1/ ]. Hence b-1  L (PX ) and P has -average  -quantile of type . ~ ~ Example 2.4 Let P be a distribution on X × Y with marginal distribution PX and regular con~ x := P(ˇ | x) on Y . Further assume that Qx is PX -surely of  -quantile type ~ ~ ~ ditional probability Q ~  > 0,   (0, 1). Let us now define a family( of distributions Pxwith marginal distribution PX ~ ˇ - m(x))/ (x) and regular conditional distributions Qx := P , x  X , where m : X  R describes a location shift and  : X  ( , 1/ ) describes a scale modification for some constant   (0, 1]. Let us further assume that the functions m and  are continuous. Then Qx has by  construction of P a  -quantile of type  > 0 given by m(x) +  (x)f ,Q , because we obtain for ~x   f,Q -m(x) f,Q -m(x) c s ~x ~x s  [0, ] the inequality Qx , +s  (cQx  )s and an ~ ~ Qx  (x)  (x)  (x) analogous inequality for the other probability in (3). Note that cQx  is bounded if b  <  . ~


The following theorem shows that for distributions having an average quantile type, the conditional quantile can be estimated that approximately solves the pinball loss. Theorem 2.5 Let p  (0, ],   (0, 1), and  > 0 be real numbers. Moreover, let P be a distribution on X × Y that has a p-average  -quantile of type  and L be the pinball loss with p+2 2p parameter  . Then for all f : X  R satisfying RL,P (f ) - R ,P  2- p+1  p+1 we have L R  /2  f - f  Lq (PX )  2 b-1 1 p (PX ) L,P (f ) - RL,P , L where f  : X  R is the  -quantile function of P, b is the function from Def. 2.2, and q :=
p p+1

.

Our next goal is to establish an oracle inequality for SVMs defined by (1). To this end let us assume Ż Ż Y = [-1, 1]. Then we have L (y , t)  L (y , t) for all y  Y , t  R, where t denotes t clipped to Ż Ż := max{-1, min{1, t}}. Since this yields RL ,P (f )  RL ,P (f ) for all the interval [-1, 1], i.e., t Ż functions f : X  R we will focus on clipped functions f in the following. In order to describe the approximation error of SVMs we need the approximation error function a() := inf f H  f 2 + RL ,P (f ) - R  ,P ,  > 0. Recall that [8] showed that lim0 a() = 0 H L if H is sufficiently rich, i.e., dense in L1 (PX ). We also need the covering numbers B : n ,  > 0, N H , , L2 (ľ) = min  1 :  x1 , . . . , xn  L2 (ľ) with BH  n 1 (xi +BL2 (ľ) ) i= where ľ is an arbitrary probability measure on X , and BH and BL2 (ľ) denote the closed unit bal(s of the RKHS H andthe Hilbert space L2 (ľ), respectively. Given a finite sequence l T = x1 , y1 ), . . . , (xn , yn ) (X × Y )n we write TX := (x1 , . . . , xn ), and we will write N (BH , , L2 (TX )) = N (BH , , L2 (ľ)) if ľ is an empirical measure defined by TX . We often write L  f instead of L(x, y , t). Theorem 2.6 Let P be a distribution on X × [-1, 1] for which there exist constants v  1,   [0, 1] such that L 2 E    Ż Ż EP  f - L  f ,P  v P (L  f - L  f ,P ) (4) for all f : X  R. Moreover, let H be a RKHS over X for which there exist constants p  (0, 1) and a  1 such that B  -2p log N H , , L2 (TX ) a ,  > 0. (5) sup
T (X ×Y )n

Then there exists a constant Kp,v depending only on p and v such that for all   1 and  > 0 we have with probability not less than 1 - 3e- that a K 1 32v  2- 145 1 2-+p(-1) ()  Kp,v a p,v a  Ż RL,P (fT , ) - RL,P  a() + + p +5 + + . n p n n n n [9] showed that oracle inequalities of the above type can be used to establish learning rates and investigate simple data-dependent parameter selection strategies. For example if we assume that Ż there exist constants c > 0 and   (0, 1] such that a()  c for all  > 0 then RL,P (fT ,n )  2  - converges to RL,P with rate n where  := min {  (2-+p(-1))+p ,  +1 } and n = n- / . Moreover, [9] shows that this rate can also be achieved by selecting  in a data-dependent way with the help of a validation set. Let us now consider how these learning rates in terms of risks translate  into rates for fT , - f ,P Lq (PX ) . To this end we assume that P has a  -quantile of p-average type Ż  for   (0, 1). Using the Lipschitz continuity of L and Theorem 2.5 we then obtain L 2 R q /2     Ż Ż Ż Ż EP  f - L  f ,P  EP |f - f ,P |2  f - f ,P 2-q EP |f - f ,P |q  c L,P (f ) - R ,P Ż  L Ż for all f satisfying RL,P (f )-R ,P  2- p+1  p+1 . In other words, we have a variance bound (4) for L p+2 2p Ż  := q /2. Arguing carefully to handle the assumption RL,P (f ) - R  2- p+1  p+1 we then see
L,P    that fT , - f ,P Lq (PX ) can converge as fast as n- , where  := min {  (4-q+p(q-2))+2p ,  +1 }. Ż
p+2 2p


To illustrate the latter let us assume that H is a Sobolev space W m (X ) of order m  N over X , where X is the unit ball in Rd . Recall from [3] that H satisfies (5) for p := d/(2m) if m > d/2 and in this case H also consists of continuous functions. Further assume that we are in the ideal situation   f ,P  W m (X ) which implies  = 1. Then the learning rate for fT , - f ,P Lq (PX ) becomes Ż n-1/(4-q(1-p)) , which for -average type distributions reduces to n-2m/(6m+d)  n-1/3 . Let us now consider the well-known -insensitive loss function is defined by L(y , t) := max{0, |y - t| - } for y , t  R, where  0. Theorem 2.7 Let P be a distribution on X × R which has a unique median, i.e., a unique (1/2) quantile f1/2,P . Further assume that all conditional distributions P(ˇ|x), x  X , are atom-free and symmetric, i.e. P(h(x) + A|x) = P(h(x) - A|x) for all x  X , A  R measurable and a suitable function h : X  R. If for an > 0 the conditional distributions have a positive mass concentrated   around f1/2,P ą then f1/2,P is the only minimizer of RL,P (ˇ), where L is the -insensitive loss. Note that using [7] one can show that for distributions specified in the above theorem the  SVM using the -insensitive loss approximates f1/2,P whenever the SVM is RL,P -consistent,  i.e. RL,P (fT , )  RL,P in probability, see [2]. More advanced results in the sense of Theorem 2.5 seem also possible, but are out of the scope of this paper.

3

Proofs

Let us first recall some notions from [7] who investigated surrogate loss functions in general and the quations how approximate risk minimizers approximate the exact risk minimizing functions in particular. To this end let L : X × Y × R  [0, ) be a measurable function which we call a loss in the following. For a distribution P and a function f : X  R the L-risk is then defined X L(x, y , f (x)) dP(x, y ), and, as usual, the minimal L-risk, or L-Bayes risk, by RL,P (f ) := ×Y is denoted by R ,P := inf RL,P (f ), where the infimum is taken over all (measurable) functions L f : X  R. In additiY n, given a distribution Q on Y the inner L-risk of Q was defined by Steinwart o L(x, y , t) dQ(y ), x  X , t  R, and the minimal inner L-risks are denoted [7] as CL,Q,x (t) :=  by CL,Q,x := inf CL,Q,x (t), x  X , where the infimum is taken over all t  R. Moreover, following [7] we usually omit the indexes x or Q if L happens to be independent of x or y , respectively. Now note that we immediately obtain X f d (6) RL,P (f ) = CL,P( ˇ |x),x (x) PX (x) ,
 and [7, Thm. 3.2] further shows that x  CL,P( ˇ |x),x is measurable if the  -algebra on X is complete. Furthermore, in this case it was shown that the intuitive formula R ,P = L X CL,P( ˇ |x),x dPX (x) holds, i.e. the Bayes L-risk is obtained by minimizing the inner risks and subsequently integrating with respect to the marginal distribution PX . Based on this observation the basic idea in [7] is to consider both stept separately. In particular, it turne,d out that the sets of s  -approximate minimizers ML,Q,x () :=  R : CL,Q,x (t) < CL,Q,x +    [0, ], and the  set of exact minimizers ML,Q,x (0+ ) := >0 ML,Q,x () play a crucial role. Often we omit the subscripts x and Q in these definitions, if L happens to be independent of x or y , respectively.

Let us now assume that we have two loss functions Ltar : X × Y × R  [0, ] and Lsur : X × Y × R  [0, ], and the goal is to estimate the excess Ltar -risk by the excess Lsur -risk. This issue was thoroughly investigated in [7], where the main device was the so-called calibration function max ( ˇ , Q, x) : [0, ]  [0, ] defined by i   nf tR\MLtar ,Q,x () CLsur ,Q,x (t) - CLsur ,Q,x if CLsur ,Q,x <  , max (, Q, x) :=   if CLsur ,Q,x =  , for all   [0, ]. In the following we sometimes write max,Ltar ,Lsur (, Q, x) := max (, Q, x) whenever it is needed to explicitly mention the target and surrogate losses. In addition, we will follow our convention which omits x or Q if L is independent of one of them. Now recall that in [7, Lem. 2.9] the inequality C    max Ltar ,Q,x (t) - CLtar ,Q,x , Q, x CLsur ,Q,x (t) - CLsur ,Q,x , tR (7)


  was established for situations where CLtar ,Q,x <  and CLsur ,Q,x < . Before we use this inequality to establish an inequality between the excess risks of Ltar and Lsur , let us finally recall that the Fenchel-Legendre bi-conjugate g  : I  [0, ] of a function g : I  [0, ] defined on an interval I is the largest convex function h : I  [0, ] satisfying h  g . In addition, we write g  () := limt g  (t) if I = [0, ). With these preparations we can now establish the following result which is a generalization of [7, Thm. 2.18].

Theorem 3.1 Let P be a distribution on X × Y with R tar ,P <  and R sur ,P <  and assume L L that there exist p  (0, ] and functions b : X  [0, ] and  : [0, )  [0, ) such that max (, P( ˇ |x), x)  b(x)  () ,   0, x  X , (8) Ż and b-1  Lp (PX ). Then for  :=  p/(p+1) : [0, )  [0, ), and all f : X  R we have R  -1 /(p+1) R p/(p+1) Ż b p p (PX ) Lsur ,P (f ) - R sur ,P   Ltar ,P (f ) - R tar ,P . L L L Ż Ż Proof: Let us first consider the case RLtar ,P (f ) < . Since   is convex and satisfies   ()  Ż() for all   [0, ) we see by Jensen's inequality that  X d R  C  Ż Ż (9)  Ltar ,P( ˇ |x),x (t) - CLtar ,P( ˇ |x),x PX (x)   Ltar ,P (f ) - R tar ,P L
 CLsur ,P( ˇ |x),x (t) - CLsur ,P( ˇ |x),x Ż ¨ ld f i or PX -almost all x  X and all t  R. By (9), the definition of  and HoR er's inequality in the Ż form of ˇ 1/q  ˇ p ˇ ˇ 1, where q := (p + 1)/p, we thus find that   Ltar ,P (f ) - RLtar ,P s less than or equal to q /q Xb f - 1/q -1/q C (x) CLsur ,P( ˇ |x),x dPX (x) (x) Lsur ,P( ˇ |x),x

Moreover, using (7) and (8) we obtain C  b(x)  Ltar ,P( ˇ |x),x (t) - CLtar ,P( ˇ |x),x


X

b-p dPX
/q

1/(pq) R

XC

Lsur ,P( ˇ |x),x

f .

(x)

-

 CLsur ,P( ˇ |x),x

d

PX (x)

1/q

b-1 1 p (PX ) L

Lsur ,P (f )

- R tar ,P L

1/q

Combining this estimate with our first estimate then gives the assertion. Let us finally deal with Ż the case RLtar ,P (f ) = . If   () = 0 there is nothing to proof and hence we restrict our  Ż considerations to the case  () > 0. Following the proof of [7, Thm. 2.13] we then see that there exist constants c1 , c2  (0, ) satisfying t  c1   (t) + c2 for all t  [0, ]. From this we obtain X d C  Ż  = RLtar ,P (f ) - R tar ,P  c1   Ltar ,P( ˇ |x),x (t) - CLtar ,P( ˇ |x),x PX (x) + c2 L X 1 b -1 C f - q  c1 (x) q Lsur ,P( ˇ |x),x (x) CLsur ,P( ˇ |x),x dPX (x) + c2 , where the last step is analogous to our considerations in the case RLtar ,P (f ) < . Us¨ ing b-1  Lp (PX ) and Holder's inequality we then conclude from the above estimate that RLsur ,P (f ) - R sur ,P = , i.e. the assertion is shown. L Our next goal is to determine the inner risks and their minimizers for the pinball loss. To this end recall (see, e.g., [1, Theorem 23.8]) that given a distribution Q on R and a non-negative function g : X  [0, ) we have R  g dQ =
0  Proposition 3.2 Let   (0, 1) and Q be a distribution on R with CL ,Q <  and t a  -quantile  of Q. Then there exists q+ , q-  [0, ) such that q+ + q- = Q({t }), and for all t  0 we have t ( d CL ,Q (t + t) - CL ,Q (t ) = tq+ + Q t , t + s) s , and (11) 0 t ( d CL ,Q (t - t) - CL ,Q (t ) = tq- + Q t - s, t ) s . (12) 0

Q(g  s) ds .

(10)


Proof: Let us consider the distribution Q(t ) defined by Q(t ) (A) := Q(t + A) for all measurable  A  R. Then it is not hard to see that 0 is a  -quantile of Q(t ) . Moreover, we obviously have  CL ,Q (t + t) = CL ,Q(t ) (t) and hence we may assume without loss of generality that t = 0. Then our assumptions together with Q((-, 0]) + Q([0, )) = 1 + Q({0}) yield   Q((-, 0])   + Q({0}), i.e., there exists a q+ satisfying 0  q+  Q({0}) and Q((-, 0]) =  + q+ . (13) Ly t us now compute the ynner risks of L . To this end obs0 rve we first assume y  0. Then we have e i e t (y - t) dQ(y ) = <0 y dQ(y ) - tQ((-, t)) + y<t y dQ(y ) and t (y - t) dQ(y ) = y<t 0 y dQ(y ) - tQ([t, )) - y<t y dQ(y ) and hence we obtain 0 y y CL ,Q (t) = ( - 1) (y - t) dQ(y ) +  (y - t) dQ(y ) <t t 0 = CL ,Q (0) -  t + tQ((-, 0)) + tQ([0, t)) - y dQ(y ) .
y <t


t Moreover, using (10) we find tQ([0, t)) - y<t y dQ(y ) = 0 Q([0, t))ds - 0 Q([s, t)) ds = t tQ({0})+ 0 Q((0, s))ds, and since (13) implies Q((-, 0))+Q({0}) = Q((-, 0]) =  +q+ we thus obtain (11). Now (12) can be derived from (11) by considering the pinball loss with parameter Ż Ż 1 -  and the distribution Q defined by Q(A) := Q(-A), A  R measurable. This further yields a real number q- satisfying 0  q-  Q({0}) and Q([0, ) = 1 -  + q- . By combining this with (13) we then find q+ + q- = Q({0}). For the proof of Theorem 2.5 we have to recall yet a few more concepts from [7]. To this end let us now assume that our loss function is indepeQdent of x, i.e., we consider a measurable function n L : Y × R  [0, ]. We write Qmin (L) :=  Qmin (L) :  t ,Q  R such that ML,Q (0+ ) = L ,  {tL,Q } i.e. Qmin (L) contains the distributions on Y whose inner L-risks have exactly one ex act minimizer. Furthermore, note that this definition immediately yields CL,Q <  for all Q  Qmin (L). Following [7] we now define the self-calibration loss function of L by  L(Q, t) := |t - t ,Q | , L Q  Qmin (L), t  R . (14)

0

t

This loss is a so-called template loss in the sense of [7], i.e., for a given distribution P on X × Y , where X has a complete  -algebra and P( ˇ |x)  Qmin (L) for PX -almost all x  X , the P instance LP (x, t) := |t - t ,P( ˇ |x) | is measurable (and hence a loss function). [7] extended the L definition of inner risks to template functions and in the case of the self-calibration loss function  this extension becomes CL,Q (t) := L(Q, t). Based on this [7] defined minimial inner risks and  their (approximate) minimizers in the obvious way. Based on this the self-calibration function was  defined by max,L,L (, Q) = inf tR; |t-t ,Q | CL,Q (t)-CL,Q . As shown in [7] this self-calibration  L function has two important properties: first it satisfies |   max,L,L t - t ,Q |, Q CL,Q (t) - CL,Q , t  R, (15)  L i.e. it measures how well approximate L-risk minimizers t approximate the true minimizer t ,Q , and L  second it equals the calibration function of the P-instance LP , i.e. max,LP ,L (, P( ˇ |x), x) = max,L,L (, P( ˇ |x)) ,     [0, ], x  X. (16)

In other words, the self-calibration function can be utilized in Theorem 3.1.
 Proof of Theorem 2.5: Let Q be a distribution on R with CL,Q <  and t be the only  -quantile of Q. Then the formulas of Proposition 3.2 show    ( d ( d, Q t - s, t ) s   0, max,L,L (, Q) = min q+ + Q t , t + s) s, q- +  0 0

where q+ and q- are the real numbers defined in Proposition 3.2. Let us additionally assume that the  -quantile t is of type . For the function1  : [0, )  [0, ) defined by  () := 2 /2, if  
1

Note the similarity to Huber's loss function for regression.


[0, ], and  () :=  - 2 /2, if  >  a simple calculation then yields max,L,L (, Q)  cQ  (),  Ż where cQ is the constant satisfying (3). For q = p/(p + 1) let us further define  : [0, )  [0, ) Ż() :=  q (1/q ),   0. In view of Theorem 3.1 we then need to find a convex function by  0 , ^ ^ Ż ^  : [0, )  [0, ) such that    . To this end we define  () := sp 2 , if   , sp ap p  , ^ and  () := ap - sp+2 ap if  > sp ap , where ap := p/(p+1) and sp := 2-1/(p+1) . An p ^ easy calculation shows that  : [0, )  [0, ) is continuously differentiable, and its derivative ^ ^ Ż ^ Ż is increasing, thus  is convex. Moreover, we have    and hence    by the fundamental ^    . We find the assertion by (15), (16), and Theorem 3.1. Ż theorem of calculus which implies  Proof of Theorem 2.6: This is a direct consquence of [9, Theorem 2.1]. The proof of Theorem 2.7 follows immediately from the following lemma. Lemma 3.3 Assume that the regular conditional distribution Q of Y given x is symmet>c around ri [ the median t . Further assume that Q is atom-free and that Q t + - , t + +  ] 0 for all  > 0. Then the inner L-risk CL,Q (t) for the -insensitive loss with  0 fulfills d [ CL,Q (0) =2 s, ) s, d d - Q[ +t [ CL,Q (t) - CL,Q (0) = Q s, ) s - s, ) s  0, if t  [0, ], t Q[ [ d d [ d t +t t 0- 0- CL,Q (t) - CL,Q ( ) = Q s, ) s - 2 Q s, ) s + 2 Q 0, s] s  0, if t > . The set of exact minimizers only contains the median, i.e. ML,Q,x (0+ ) = {t }. Proof: Because L(y , t) = L(-y , -t) for all y , t  R we only have to consider t  0. Analogous to the proof of Proposition 3.2 we may assume w.l.o.g. that Q is symmetric around t = 0. For later use we note that for 0  a  b   Equation (10) yields b b y dQ(y ) = aQ([a, b]) + Q([s, b])ds . (17)
a a

 Moreover, the definition of L implies CL,Q (t) = t - y - dQ(y) + t+ y - - t dQ(y ). Using  t -- the symmetry of Q yields -  y dQ(y ) = -t y dQ(y ) and hence we obtain t+  0 t- 0 t+ - CL,Q (t) = Q(-, t - ]ds - Q[t + , )ds + y dQ(y ) + 2 dQ(y ) . (18)
t t+ y tt+ y dQ(y ) = - y dQ(y ), Let us first consider the case t  . Then the symmetry of Q yields t and hence (17) implies 0 t- 0 t- t t+ [ d [ d CL,Q (t) = Q[ - t, )ds + Q t- , t+ ] s + s, t + ] s

t -- 

t- +

-

Q

 +2
t+ Q

[ d s, ) s +

0 t+

[ d Q t + , ) s. [ d s, t + ) s we further obtain  [ d d s, ) s s+
t+ Q

Using

tt+ - Q

[ d [ d t t 0+ 0- s, t + ) s = Q s, t + ) s - Q t t+ 0 t+ [ d [ s, t + ) s + Q t + , )
- Q

 =
0

[ d Q s, ) s -

0 t-

[ d Q s, t + ) s . d s follows

From this and CL,Q (t) = -

t 0-

0 t-

[ d [ d [ t t 0- 0- Q t - ,t + ] s - Q s, t + ] s = - Q s, t - ] t   [ d 0- [ d [ d Q s, t - ] s+ Q - t, ) s+ s, ) s+ Q
t+ Q 0

[ d s, ) s .


[ d [ d t t 0- 0- The symmetry of Q implies Q - t, t - ] s = 2 Q 0, t - ] s, and we get 0 t- 0 t- 0 t- 0 t- [ d [ d [ d [ d - Q s, t - ] s + Q - t, ) s = 2 Q 0, s) s + Q s, ) s . [ d d [ d [ d  [  t 0+ This and t+ Q s, ) s + 0 Q s, ) s = 2 t+ Q s, ) s + Q s, ) s yields  0 t- 0 t- 0 t+ [ d [ d [ d [ d CL,Q (t) = 2 Q 0, s) s + Q s, ) s + 2 s, ) s + Q s, ) s .
t+ Q

[ d [ d [ d [ d t t 0+ 0- tt+ By Q s, ) s + Q s, ) s = 2 Q s, ) s + - Q s, ) s we obtain [ d [ d [ d t  0- tt+ CL,Q (t) = 2 Q 0, ) s + 2 t+ Q s, ) s + - Q s, ) s, if t  . Let us now consider the case t  [0, ]. Analogously we obtain from (18) that CL,Q (t) equals +t  -t [ d [ d [ d Q - t, t + ] s + Q s, t + ] s + 2 Q s, ) s
0 -t +t

t 0-

[ d + t, ) s - Q - t, ) s - Q + t, ) s . 0 0 0 d d d -t [ -t [ -t [ Combining this with 0 Q - t, t + ] s - 0 Q - t, ) s = - 0 Q + t, ) s d d d +t [ -t [ +t [ and 0 Q + t, ) s - 0 Q + t, ) s = -t Q + t, ) s we get +t +t  [ d [ d [ d CL,Q (t) = Q + t, ) s + Q s, t + ] s + 2 Q s, ) s +2 Q
-t -t +t

+t

[

d

-t

[

d

+t

   [ d [ d [ d [ d = Q s, ) s + 2 Q s, ) s = Q s, ) s + Q s, ) s. -t -t +t d +t [ Hence CL,Q (0) = 2 s, ) s. The expressions for CL,Q (t) - CL,Q (0), t  (0, ], and Q CL,Q (t) - CL,Q ( ), t > , given in Lemma 3.3 follow by using the same arguments. Hence one exact minimizer of CL,Q (ˇ) is the median t = 0. Finally, since Q does not have atoms the function s  Q[s, ) is continuous and hence the fundamental theorem of calculus shows t.hat the derivative [ -[ Q + t, ) Since CL,Q (ˇ) : of CL,Q (ˇ) : [0, )  R is given by CL,Q (t) = Q - t, ) [0, )  R is convex we see that t  (0, ) minimizes CL,Q (ˇ) if and only if Q[ - t, ) = Q[ [ + t, ). Th> latter is only satisfied if Q does not have a positive mass around . I.e. if e Q - , +  ] 0 for all  > 0, then the set of exact minimizers is ML,Q,x (0+ ) = {0}.

+t

References
[1] H. Bauer. Measure and Integration Theory. De Gruyter, Berlin, 2001. [2] A. Christmann and I. Steinwart. Consistency and robustness of kernel based regression. To appear in: Bernoulli, 2007. [3] D.E. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Differential Operators. Cambridge University Press, 1996. [4] C. Hwang and J. Shim. A simple quantile regression via support vector machine. In Advances in Natural Computation: First International Conference (ICNC), pages 512 ­520. Springer, 2005. [5] R. Koenker. Quantile Regression. Cambridge University Press, 2005. ¨ [6] B. Scholkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, 12:1207­1245, 2000. [7] I. Steinwart. How to compare different loss functions. Constr. Approx., 26:225­287, 2007. [8] I. Steinwart, D. Hush, and C. Scovel. Function classes that approximate the Bayes risk. In Proceedings of the 19th Annual Conference on Learning Theory, COLT 2006, pages 79­93. Springer, 2006. [9] I. Steinwart, D. Hush, and C. Scovel. An oracle inequality for clipped regularized risk minimizers. In Advances in Neural Information Processing Systems 20, to appear, 2007. [10] I. Takeuchi, Q.V. Le, T.D. Sears, and A.J. Smola. Nonparametric quantile estimation. J. Mach. Learn. Res., 7:1231­1264, 2006.