Learning coordinate gradients with multi-task kernels

Yiming Ying and Colin Campbell Department of Engineering Mathematics, University of Bristol Queens Building, University Walk, Bristol, BS8 1TR, UK. {enxyy, C.Campbell}@bris.ac.uk

Abstract
Coordinate gradient learning is motivated by the problem of variable selection and determining variable covariation. In this paper we propose a novel unifying framework for coordinate gradient learning (MGL) from the perspective of multi-task learning. Our approach relies on multi-task kernels to simulate the structure of gradient learning. This has several appealing properties. Firstly, it allows us to introduce a novel algorithm which appropriately captures the inherent structure of coordinate gradient learning. Secondly, this approach gives rise to a clear algorithmic process: a computational optimization algorithm which is memory and time efficient. Finally, a statistical error analysis ensures convergence of the estimated function and its gradient to the true function and true gradient. We report some preliminary experiments to validate MGL for variable selection as well as determining variable covariation.

1

Introduction

Let X  Rd be compact, Y  R, Z = X × Y , and Nn = {1, 2 . . . , n} for any n  N. A common theme in machine learning is to learn a target function f : X  Y from a finite set of input/output samples z = {(xi , yi ) : i  Nm }  Z. However, in many applications, we not only wish to learn the target function, but also want to find which variables are salient and how these variables interact with each other. This problem has practical motivations: to facilitate data visualization and dimensionality reduction, for example. Such a motivation is important when there are many redundant variables and we wish to find the salient features among these. These problems can occur in many contexts. For example, with gene expression array datasets, the vast majority of features may be redundant to a classification task and we need to find a small set of genuinely distinguishing features. These motivations have driven the design of various statistical and machine learning models [8, 11, 21, 22] for variable (feature) selection. Here, we build on previous contributions [15, 16, 17] by addressing coordinate gradient learning and its use for variable


We acknowledge support from EPSRC grant EP/E027296/1.

selection and covariation learning, the interaction between variables. Specifically, for any x  X , we denote x by (x1 , x2 , . . . , xd ). The target is to learn the gradient of f (i it exists) den.oted by a vector-valued function f (x) = f  f f The intuition behind gradient learning for  x1 , . . . ,  x d variable selection and coordinate covariation is the following. The inner product between components of f indicates the interaction between coordinate variables. Specific f norms of  x can indicate the salience of the p-th variable: p the smaller the norm is, the less important this variable will be. In this paper we propose a novel unifying formulation of coordinate gradient learning from the perspective of multi-task learning. Learning multiple tasks together has been extensively studied both theoretically and practically in several papers [1, 2, 5, 7, 13, 14]. One way to frame this problem is to learn a vector-valued function where each of its components is a real-valued function and corresponds to a particular task. A key objective in this formulation is to capture an appropriate structure among tasks so that common information is shared across tasks. Here we follow this methodol-  -  ogy and employ a vector-valued function f = (f1 , f2 ) = -  (f1 , f2 , . . . , fd+1 ), where f1 is used to simulate f , and f2 -  is used to simulate its gradient f . We assume that f comes from a vector-valued reproducing kernel Hilbert space (RKHS) associated with multi-task (matrix-valued) kernels, see [14]. The rich structure of RKHS space reflects the latent structure of multi-task gradient learning, i.e. the pooling of information across components (tasks) of f using multitask kernels. The paper is organized as follows. In Section 2, we first review the definition of multi-task kernels and vector-valued RKHS. Then, we propose a unifying formulation of coordinate gradient learning from the perspective of multi-task learning which is referred to as multi-task gradient learning (MGL). The choices of multi-task kernels motivate different learning models [11, 15, 16, 17]. This allows us to introduce a novel choice of multi-task kernel which reveals the inherent structure of gradient learning. Kernel methods [19, 20] usually enjoy the representer theorem which paves the way for designing efficient optimization algorithms. In Section 3 we explore a representer theorem for MGL algorithms. Subsequently, in Section 4 we discuss computational optimization approaches for MGL algorithms, mainly focusing on least square loss and the SVM counterpart for gradient learning.


A statistical error analysis in Section 5 ensures the convergence of the estimated function and its gradient to the true function and true gradient. Finally, in Section 6 preliminary numerical experiments are reported to validate our proposed approach. 1.1 Related work

j  Nm } there holds i
,j Nm

yi , K(xi , xj )yj  0.

(1)

A number of machine learning and statistical models have been proposed for variable (feature) selection. Least absolute shrinkage and selection operator (LASSO) [21] and basis pursuit denoising [8] suggest use of 1 regularization to remove redundant features. Weston et al [22] introduced a method for selecting features by minimizing bounds on the leave-one-out error. Guyon et al [11] proposed recursive feature elimination (RFE) which used a linear kernel SVM: variables with least influence on the weights 1 w 2 are considered least important. 2 Although these algorithms are promising, there remain unresolved issues. For example, they do not indicate variable covariation and the extension of these algorithms to the nonlinear case was marginally discussed. Our method outlined here covers variable covariation and nonlinear feature selection. As such, in Section 2, we show that RFE-SVM is a special case of our multi-task formulation. Motivated by the Taylor expansion of a function at samples {xi : i  Nm }, Mukherjee et al [15, 16, 17] proposed an algorithm for learning the gradient function. They used the norm of its components for variable (feature) selection and spectral decomposition of the covariance of the learned gradient function for dimension reduction [16]. Specifically, let HG be a scalar RKHS (see e.g. [3]) and use f1  HG to simulate f . For any p  Nd , a function fp+1  HG is used to learn  f / xp . The results presented by Mukherjee et al are quite promising both theoretically and practically, but there is no pooling information shared across the components of the gradient. This may lead to less accurate approximation to the true gradient. We will address all these issues in our unifying framework.

In the spirit of Moore-Aronszjain's theorem, there exists a one-to-one correspondence between the multi-task kernel K with property (1) and a vector-valued RKHS of functions -  f : X  Rd+1 with norm ·, · K denoted by HK , see e.g. -  [14]. Moreover, for any x  X , y  Rd+1 and f  HK , we have the reproducing property   - (x), y = - , Kx y K f f (2) where Kx y : X  Rd+1 is defined, for any t  X , by Kx y(t) := K(t, x)y. In the following we describe our multi-task kernel-based framework for gradient learning. Following Mukherjee et al [15, 17], the derivation of gradient learning can be motivated by the Taylor expansion of f : f (xi )  f (xj )+ f (xj )(xi - -  xj )T . Since we wish to learn f with f1 and f with f2 , replacing f (xi ) by yi , the error1 -  yi  f1 (xj ) + f2 (xj )(xi - xj )T is expected to be small whenever xi is close to xj . To enforce the constraint that xi is close to xj , we introduce a weight function produced by a Gaussian with deviation s defined by wij = sd1 2 e 2s2 . This implies that wij  0 if xi is far + away from xj . We now propose the following multi-task formulation for gradient learning (MGL): 1i -  fz = arg min wij L(yi , 2 -  f HK m ,j . -   f1 (xj ) + f2 (xj )(xi - xj )T ) +  - 2 fK
- xi -xj 2

(3)

2 Multi-task kernels and learning gradients
In this section we formulate the gradient learning problem from the perspective of multi-task learning. Specifically, we employ a vector-valued RKHS to simulate the target function and its gradient. The abundant structure of vector-valued RKHS enables us to couple information across components of the gradient in terms of multi-task kernels. 2.1 Multi-task model for gradient learning

We begin with a review of the definition of multi-task kernels and introduce vector-valued RKHS (see [14] and the reference therein). Throughout this paper, we use the notation ·, · and · to denote the standard Euclidean inner product and norm respectively. Definition 1 We say that a function K : X × X  R × Rd+1 is a multi-task (matrix-valued) kernel on X if, for any x, t  X , K(x, t)T = K(t, x), and it is positive semi-definite, i.e., for any m  N, {xj  X : j  Nm } and {yj  Rd+1 :
d+1

where L : R × R  [0, ) is a prescribed loss function and  is usually called the regularization parameter. The minimum is taken over a vector-valued RKHS with multi-  task kernel K. The first component f1,z of the minimizer fz of the above algorithm is used to simulate the target function -  and the other components f2z := (f2,z , . . . , fd+1,z ) to learn its gradient function. In Section 6, we will discuss how to use -  the solution fz for variable selection as well as covariation measurement. Different choice of loss functions yield different gradient learning algorithms. For instance, if the loss function L(y , t) = (y - t)2 then algorithm (3) leads to the least-square multitask gradient learning (LSMGL): 1i y arg min wij i - f1 (xj ) -  m2 f HK ,j Nm (4) . 2 -   - f2 (xj )(xi - xj )T +  - 2 fK In classification, the choice of loss function L(y , t) = (1 - y t)+ in algorithm (3) yields the support vector machine for multi-task gradient learning (SVMMGL):
Our form of Taylor expansion is slightly different from that used in [15, 17]. However, the essential idea is the same.
1


i 1 arg min wij - yi (f1 (xj ) 2 -  ,j Nm f HK m bR . + -   +b + f2 (xj )(xi - xj )T ) +  - 2 fK

1

(5)

-  Here, f1 (x) + b is used to learn the target function and f2 , simulating the gradient of the target function. Hence b plays the same role of offset as in the standard SVM formulation. In this case, at each point xi the error between the output yi and f (xi ) is now replaced by the error between yi and the first order Taylor expansion of f (xi ) at xj , i.e., f1 (xj ) + -  f2 (xj )(xi - xj )T . 2.2 Choice of multi-task kernels We note that if K is a diagonal matrix-valued kernel, then each component of a vector-valued function in the associated RKHS of K can be represented, independently of the other components, as a function in the RKHS of a scalar kernel. Consequently, for a scalar kernel G if we choose the multi-task kernel K given, for any x, t  X , by K(x, t) = G(x, t)Id+1 then the MGL algorithm (3) is reduced to the gradient learning algorithm proposed in [15, 16, 17] using (d + 1)-folds of scalar RKHS. There, under some conditions on the underlying distribution , it has been proven -  that f1,z  f and f2z  f when the number of samples tends to infinity. Although their results are promising both theoretically and practically, a more inherent structure -  would be f2z = f1,z . In our MGL framework (3), we can recover this structure by choosing the multi-task kernel appropriately. Our alternative choice of multi-task kernel is stimulated by the Hessian of Gaussian kernel proposed in [7]. For any scalar kernel G and any x, t  X , we introduce the function ( G (x, t), ( tG(x, t))T 6) K(x, t) = xG(x, t) 2 t G(x, t) x which we will show to be a multi-task kernel. To se this, e 2 let 2 be the Hilbert space with norm w 22 = j =1 wj . Suppose that G has a feature representation, i.e., G(x, t) = (x), (t) 2 and, for any f  HG , there exits a vector w  2 such that f (x) = w, (x) 2 and f G= w
2.

Proof: Since G is a scalar kernel, for any x, t  X we have that G(x, t) = G(t, x). Therefore, K(x, t)T = K(t, x). Moreover, G is assumed to be a Mercer kernel which implies that it has a feature representation G(x, t) = (x), (t) 2. Consequently, tG(x, t) = ((x), (t)) and xG(x, t) = ( (x), (t)), and 2 t G(x, t) = (x), (t) 2 . Then, x we introduce, for any w  2, x  X, y  Rd+1 , the feature map (x) : 2  Rd+1 defined by (x)w := ( (x), w 2, 1 (x), w 2, . . . , d (x), w 2) . Its adjoint map  is given, for any t  X and y  Rd+1 , by p  (t)y := (x)y1 + Nd p (x)yp+1 . Hence, K(x, t)y = (x) (t)y. Consequently, for any m  N, any i, j  Nm i and yi , yj  Rd+1 , it follows ,j Nm yi , K (xi , xj )yj =  iNm  (xi )yi 22 is nonnegative which tells us that K is a multi-task kernel. -  We turn to the second assertion. When f is in the form of a finite combination of kernel section {Kx y : y  Rd+1 , x  X }, the second assertion follows directly from the definition of K. For the general case, we use the fact that the vectorvalued RKHS is the closure of the span of kernel sections, see [14]. To this end, assume that there exists a sequence -  j j j { fj = (f1 , f2 , . . . , fd+1 )} of finite combination of kernel -  -  sections such that fj  f  HK w.r.t. the RKHS norm. Hence, by the reproducing property (2), for any x  X and -  -  j p  Nd , |fp+1 (x) - fp+1 (x)| = | ep+1 , fj (x) - f (x) | = e -  -  T K(x, x)e | - - f , Kx ep+1 K|  - - f K fj fj p+1 p+1 which tends to zeros as j tends to infinity. Consequently, it follows, for any x  X ,
j fp+1 (x)  fp+1 (x), T

as j  .

(8)

Let p  Rd be a vector with its p-th component  > 0 and others equal zero. Applying the reproducing property (2) yields that [ =j j 1


-  - - f , Kx+p e1 - Kx e1 K fj K -  - - f K e1 , 12 (x + p , x + p ) e fj +K(x, x) - K(x1 x + p ) - K(x + p , x) 1 2 , 1 G -  = - - f K 2 (x + p , x + p ) fj 1 2 G(x, x) - G(x, x + p ) - G(x + p , x) .

f1 (x+p )-f1 (x)]-[f1 (x+p )-f1 (x)] 

Indeed, if the input space X is compact and G : X × X  R is a Mercer kernel, i.e., it is continuous, symmetric and positive semi-definite, then, according to Mercer theorem, G always has the above feature representation (see e.g. [9]). Now we have the following proposition about K defined by equation (6). Let ep be the p-th coordinate basis in Rd+1 . Theorem 2 For any smooth scalar Mercer kernel G, define function K by equation (6). Then, K is a multi-task kernel -  -  and, for any f = (f1 , f2 )  HK there holds -  f2 = f1 . (7)

Since G is smooth and X is compact there exists an absolute constant c > 0 such that, for any  > 0, the above equation is furthermore bounded by [ j j
f1 (x+p )-f1 (x)]-[f1 (x+p )-f1 (x)] 

-  c - - f K. fj

Consequently, letting   0 in the above equation it follows j |p f1 (x) - p f1 (x)|  0 as j tends to infinity. Combining j j this with equation (8) and the fact that fp+1 (x) = p f1 (x) implies that p f1 (x) = fp+1 (x) which completes the theorem.


The scalar kernel G plays the role of a hyper-parameter to produce the multi-task kernel K given by equation (6). By the above theorem, if we choose K to be defined by equation -  (6) then any solution fz = (f1,z , f2z ) of algorithm (3) enjoys -  the structure f2z = f1,z . Further specifying the kernel G in the definition (6) of multitask kernel K, we can recover the RFE feature ranking algorithm for a linear SVM [11]. To see this, let G be a linear ker-  nel. In the next section, we will see that, for any solution fz of MGL algorithm (3), there exists {cj,z  Rd+1 : j  Nm } j -  such that fz = Nm Kxj cj,z . Since G is linear, comT bining this with Theorem 2 we know that f1,z (x) = Wz x jT -  T with Wz = (xj , 1)cj,z  R and f2z = f1,z = Wz . Consequently, in the case we have that -  T f1,z (xj ) + f2z (xj )(xi - xj ) = Wz xi = f1,z (xi ). Moreover, by the reproducing property (2) we can check that  - K = f1,z G = Wz 2. fz 2 2 Putting the above equations together, in this special case we know that the SVMMGL algorithm (5) is reduced, with the choice of wij = 1, to the classical learning algorithm: 1i . min (1 - yi (W T xi + b))+ +  W 2 W Rd m
Nm

and, for every j  Nm , the representer coefficient cj,z  span{e1 , xi : i  Nm }. -  - -  Proof: We can write any minimizer fz  HK as fz = f + a K -  -  f  where f is in the span xj e1 , Kxj xi , i, j  Nm -  nd f  is perpendicular to this span space. By the repro ducing property (2), we have that - (xj ), e1 + xi - xj = f   - , Kxj (e1 + xi - xj ) K = - , Kxj (e1 + xi - xj ) K. f f -  Hence, f  makes no contribution to the loss function in the MGL algorithm (9) (i.e. algorithm (3)). However, the    norm - 2 = - 2 + -  2 > f 2 unless f = fK fK fK K -  0.KThis implies, any solutioa fz belongs to the span space n nd the corresponding represenxj e1 , Kxj xi , i, j  Nm ter coefficients belong to the span of {e1 , xi : i  Nm }. The representer theorem above tells us that the optimal so-  lution fz of algorithm (3) lives in the finite span of training samples which paves the way for designing efficient optimization algorithms for multi-task gradient learning.

4 Optimization and solution
In this section, by the above representer theorem, we explore efficient algorithms for computing the representer coefficients. For clarity, we mainly focus on least-square multitask gradient learning algorithms (LSMGL). At the end of this section, the support vector machine for gradient learning (SVMMGL) in classification will be briefly discussed. One can apply the subsequent procedures to other loss functions. 4.1 Computation of representer coefficients To specify the solution of LSMGL, we denote the column vector Cz  Rm(d+1) by consecutively catenating all column vectors {cj,z  Rd+1 : j  Nm } and, likewise we define a column vector Y  Rm(d+1) by catenating column vectors {yi  Rd+1 : i  Nm }. Moreover, we introduce an m(d + 1) × m(d + 1) matrix by catenating all (d + 1) × (d + 1) matrix K(xi , xj ) denoted by Kx = (K(xi , xj ))i,j Nm . Finally, we introduce a system of equations l m2 cj + Bj K(xj , xl )cl = yj , j  Nm
Nm

Hence, our formulation of gradient learning (3) can be regarded as a generalization of RFE-SVM [11] to the nonlinear case. In the subsequent sections we discuss a general representation theorem and computational optimization problems motivated by MGL algorithms.

3

Representer theorem

In this section we investigate the representer theorem for the MGL algorithm (3). This forms a foundation for the derivation of a computationally efficient algorithm for MGL in Section 4. Recall that ep is the p-th coordinate basis in Rd+1 and, for any x  Rd , denote the vector xT by (0, xT ). By the repro ducing property (2), we have that f1 (xj ) = - (xj ), e1 = f  -   - , Kxj e1 K and likewise, f2 (xj )(xi -xj )T = - (xj ), xi - f f  xj = - , Kxj (xi - xj ) K. Then, the algorithm (3) can be f rewritten by 1 i y wij L i , - , f arg min m2 -  ,j Nm f HK . (9) + Kxj (e1 + xi - xj ) K - 2 fK In analogy with standard kernel methods [19, 20], we have the following representer theorem for MGL by using the properties of multi-task kernels. Theorem 3 For any multi-task kernel K, consider the gradient learning algorithm (3). Then, there exists representer coefficients {cj,z  Rd+1 : j  Nm } such that - j Kxj cj,z fz =
Nm

(10)

i T whiere Bj = Nm wij (e1 + xi - xj )(e1 + xi - xj ) , yj = Nm wij yi (e1 + xi - xj ). We now can solve the LSMGL algorithm by the following theorem. Theorem 4 For any j  Nm , the vectors Bj , yj be defined by equation (10). Then, the representer coefficients Cz for the solution of the LSMGL algorithm are given by the following equation m C B Km 2 Y= Im(d+1) + diag 1 , . . . , Bm x i,j =1 z . (11) Proof: By Theorem 3, there exists {cj,z  Rd+1 : j  j -  Nm } such that fz = Nm Kxj cj,z . However, taking the


functional derivative of algorithm (3) with respective to f - i  1 yields that m2 ,j Nm wij fz , Kxj (e1 + xi - xj ) K - K -  yi xj (e1 + xi - xj ) + fz = 0 which means that cj,z = i  1 fz Nm wij (yi - - , Kxj (e1 + xi - xj ) K)(e1 + xi - m2  xj ). Equivalently, equation (10) holds true, and hence completes the assertion. Solving equation (11) involves the inversion of an m(d + 1) × m(d + 1) matrix whose time complexity is usually O((md)3 ). However, it is computationally prohibitive since the coordinate (feature) dimension d is very large in many applications. Fortunately, as suggested in Theorem 3, the representer coefficients {cj,z : j  Nm } can be represented by the span of column vectors of matrix Mx = {e1 , x1 , . . . , xm-1 , xm }. This observation suggests the possibility of reduction of the original high dimensional problem in Rd+1 to the low dimensional space spanned by Mx . This low dimensional space can naturally be introduced by singular vectors of Mx . To this end, we consider the representation of the matrix Mx by its singular vectors. It will be proven to be useful to represent matrix Mx from the singular value decomposition (SVD) of the data matrix defined by x . Mx = 1 , x2 , . . . , xm-1 , xm Apparently, the rank s of Mx is at most min(m, d). The SVD of Mx tells us that there exists orthogonal matrices Vd×d and Um×m such that  T U1 .  Mx = [V1 , . . . , Vd ]   .  . (12) Um = [V1 , . . . , Vs ] (1 , . . . , m ) d  0. iag 1 , . . . , s Here, the d × m matrix  = For 0 0 any j  Nm , we use the notation Uj = (U1j , . . . , Umj ) and T j = (1 U1j , 2 U2j , . . . , s Usj )  Rs . From now on we also denote V = [V1 , . . . , Vs ] (13) Hence, we have, for any j  Nm , that xj = V j . We are now ready to specify the representation of Mx from the above SVD of Mx , To see this, for any l  Ns and j  T T Nm , let VlT = (0, VlT ), j = (0, j ). In addition, we introduce the (d + 1) × (s + 1) matrix 1 =e ( 0 V= V V 14) 1, 1, . . . , s 0V which induces a one-to-one mapping V : Rs+1  Rd+1 defined, for any   Rs+1 , by x = V   Rd+1 since column vectors in V are orthogonal to each other. Consequently, it follows that Mx = V [e1 , 1 , . . . , m ], where e1 is the standard first coordinate basis in Rs+1 . Equivalently, for any i, j  Nm , e1 = V e1 , xj = V j . (15)

We now assemble all material to state the reduced system associated with equation (11). For this purpose, firstly we define the kernel K, for any x, t  X , by K(x, t) := V T K(x, t)V , and introduce the m(s + 1) × m(s + 1) matrix by catenating all (s + 1) × (s + 1) matrices K(xi , xj ): K i Kx = (xi , xj ) ,j N . (16)
m

 Secondly, for any j  Nm , set Bj = Nm wij (e1 + i - i j )(e1 + i - j )T and Yj =   Nm wij yi (e1 + i - j ). Thirdly, associated with the system (10), for any j  Nm and j  Rs+1 , we define the system in reduced low dimensional space Rs+1 l K(xj , xl )l = Yj . m2 j + Bj (17)
Nm

i

Finally, in analogy with the notation Y, the column vector Y  Rm(s+1) is defined by successively catenating column vectors {Yi  Rs+1 : i  Nm }. Likewise we can define z by catenating column vectors {j,z  Rs+1 : j  Nm }. With the above preparation we have the following result. Theorem 5 If the {j,z  Rs+1 : j  Nm } is the solution of system (17), i.e., m B K 2 Y= Im(s+1) + diag 1 , . . . , Bm x , (18) then the coefficient Cz defined, for any j  Nm , by cj,z = V j,z is one of the solution of system (11), and thus yields -  representation coefficients of the solution f z for the LSMGL algorithm (3). Proof: Let z is the solution of system (17). Since V is orthogonal, the system (17) is equivalent to the following equation l K(xj , xl )lz = V Yj . m2 V j z + V Bj
Nm

Recall, for any j  Nm , that Bj = V Bj V T , V e1 e1 V T = e1 eT , V T V = Is+1 , and Yj = V Yj . Hence, the above sys1 tem is identical to system (11) (i.e. system (10)) with Cj replaced by V j,z which completes the assertion. We end this subsection with a brief discussion of the solution of the SVMMGL algorithm. Since the hinge loss is not differentiable, we cannot use the above techniques to derive an optimization algorithm for SVMMGL. Instead, we can consider its dual problem. To this end, we introduce slack variables {ij : i, j  Nm } and rewrite SVMMGL as follows: s 1i   arg min wij ij +  - 2 fK   -   m2 f ,,b ,j Nm    .t. yi ( - , Kxj (e1 + xi - xj ) K + b)  1 - ij , f   ij  0, i, j  Nm . (19)


Given a multi-task kernel K produced by scalar kernel G,  > 0 and inputs {(xi , yi ) : i  Nm } 1. Compute projection mapping V and reduced vector j from equations (12), (13), and (14). 2. Compute K(xj , xi ) = V T K(xj , xi )V (i.e. equation (16)) 3. Solving equation (18) to get coefficient z , see Theorem 5 (equivalently, equation (25) when G is linear or RBF kernel, see Theorem 6). j -  V 4. Output vector-valued function: fz (·) = Nm K(·, xj )( j,z ). 5. Compute variable covariance and ranking variables using Proposition 1 in Section 6.

Table 1: Pseudo-code for least square multi-task gradient learning Parallel to the derivation of the dual problem of standard SVM (e.g. [20, 23]), using Lagrangian theory we can obtain the following dual problem of SVMMGL:  i 1i  × g max ij - ij yi i j yi  ar    4m2  ,j Nm ,j,i ,j Nm s( e1 + xi - xj )T K(xj , xj )(e1 + xi - xj )    i  .t. ,j Nm yi ij = 0, 0  ij  wij , i, j  Nm . (20) Moreover, if the solution of dual problem is z = {ij,z : i, j  Nm } then the solution of SVMMGL can be represented by -  fz = 1i 2m2  yi ij,z Kxj (e1 + xi - xj ).
,j Nm

In analogy with the derivation of the system (10), for any j  Nm and j  Rs+1 , we know that the representer coefficients of the LSMGL algorithm (21) in reduced low dimensional space Rs+1 satisfy that l m2 j + Bj K(j , l )l = Yj . (22)
Nm

Note that ( e1 + xi - xj )T K(xj , xj

)(e1

+ xi

-

xj

(
)

i,j ),(i ,j

)

is a scalar kernel matrix with double indices (i, j ) and (i , j ). Then, when the number of samples is small the dual problem of SVMMGL can be efficiently solved by quadratic pro2 gramming with   Rm . 4.2 Further low dimensional formulation Consider the multi-task kernel K defined by equation (6) with scalar kernel G. In this section, by further specifying G we show that LSMGL algorithm (4) with input/output {(xi , yi ) : i  Nm } can be reduced to its low dimensional formulation with input/output {(i , yi ) : i  Nm }, where j is defined by equation (12). This clarification will provide a more computationally efficient algorithm. To this end, consider the scalar kernel G defined on Rd × Rd . By definition of kernels (called restriction theorem in [3]), we can see that G is also a reproducing kernel on Rs × Rs . Hence, K defined by equation (6) on Rd is also a multitask kernel on the underlying space Rs ; we use the same notation K when no confusion arises. Therefore, associated with the LSMGL algorithm (4) in Rd we have an LSMGL in low dimensional input space Rs : 1 i y - = arg min-  gz wij i - HK m2 g ,j . (21) 2   g1 (j ) - - (j )(i - j )T +  - 2 g2 g
K

We are now ready to discuss the relation between representer coefficients of the LSMGL algorithm (4) and those of reduced LSMGL algorithm (21). For this purpose, let G to satisfy, for any d × s matrix V , that V T V = Is and  ,   Rs , that G(V  , V  ) = G( ,  ). (23) There exists abundant functions G satisfying the above property. For instance, linear product kernel G(x, t) = xT t, 2 Gaussian kernel G(x, t) = e- x-t /2 , and sigmoid kernel T G(x, t) = tanh(ax t + r) with parameters a, r  R. More generally, kernel G satisfies property (23) if it is produced by a radial basis function (RBF) h : (0, )  R defined, for any x, t  X , by G(x, t) = h( x - t 2). (24) We say a function h : (0, )  R is complete monotone if it is smooth and, for any r > 0 and k  N, (-1)k f (k) (r)  0. Here h(k) denotes the k -th derivative of h. According to the well-known Schoenberg's theorem [18], if h is complete monotone then the function G defined by equation (24) is positive semi-definite, and hence becomes a scalar kernel. -t For instance, the choice of h(t) = e 2 with standard deviation  > 0 and h(t) = ( 2 + x - t 2)- with parameter  > 0 yield Laplacian kernel and inverse polynomial kernel respectively. Now we are in a position to summarize the reduction theorem for multi-task kernels (6) produced by scalar kernels G satisfying (23).i Here we also use the convention that K K = (i , j ) ,j Nm . Theorem 6 Let G have the property (23) and K be defined by equation (6). Suppose {j,z : j  Nm } are the representer coefficients of algorithm (21), i.e., z solves the equation m B K 2 Y= Im(s+1) + diag 1 , . . . , Bm  z , (25) Then, the representer coefficients {cj,z : j  Nm } of algorithm (4) are given by cj,z = V j,z .


Proof: Suppose the multi-task kernel K is produced by G with property (23). Recall that V T V = Is with V given by w (14). Then, kernel K satisfies, for any x = V  and t = V  ith V , that K(x, t) = K(V i , V j ) G ( ,  ), (V i G(i , j ))T = V j G(i , j ), V ( i j G(i , j ))V T .

and, for some parameter 0 <   1, the density function ¨ p(x) of X satisfies -Holder continuous condition, i.e., for any x, u  X there holds |p(x) - p(u)|  c x - u , x, u  X. (27) Of course, p is a bounded function on X since it is continuous and X is compact. For instance, if the boundary of X is piecewise smooth and X is the uniform distribution over X then the marginal distribution X satisfies conditions (26) and (27) with parameter  = 1. We are ready to present our statistical error analysis of LSMGL -  algorithms. Recall here we used the notation f = (f , f ). Theorem 7 Suppose that the marginal distribution X sat-  isfies (26) and (27). For any multi-task kernel K, let fz be -  the solution of LSMGL algorithm. If f  HK then there exists a constant c such that, for any m  N, with the choice 1 of  = s2 and s = m- 3(d+2)+4 , there holds  --   E fz - f 2X cm- 3(d+2)+4 .  If moreover X is a uniform distribution then, choosing  = 1 s and s = m- 3(d+2)+5 , there holds --    E fz - f X 2 cm- 3(d+2+) . The proof of this theorem needs several steps which are postponed to the appendix. More accurate error rates in terms of probability inequality are possible using techniques in [17, 15]. It would also be interesting to extend this theorem to other loss functions such as the SVMMGL algorithm.

Hence, it follows, for any xi , xj  X and i, j  Nm , that 1 T 1 = 0 0 K(xi , xj ) = K(xi , xj ) K(i , j ). 0V 0V Therefore, the system (17) is identical to the system (22). Consequently, the desired assertion follows directly from Theorem 5. Equipped with Theorem 6, the time complexity and computer memory can be further reduced by directly computing m(s + 1) × m(s + 1) matrix K instead of first computing m(d + 1) × m(d + 1) matrix Kx and then Kx in Theorem 5. Theorem 6 also gives an appealing insight into multi-task gradient learning framework. Roughly speaking, learning gradient in the high dimensional space is equivalent to learning them in the low dimensional projection space spanned by the input data.

5 Statistical error analysis
In this section we give an error analysis for least square MGL algorithms. Our target is to show that the learned vectorvalued function from our algorithm statistically converges to the true function and true gradient. For the least square loss, denote by X (·) the marginal distribution on X and, for any x  X , let (·|x) to be the conditional distribution on Y . Then, the target function is the regression function f minimizing the generalization error Z E (f ) = (y - f (x))2 d(x, y ). Specifically, the regression function is defined, for any x  X , by Y Y f (x) = arg min (y - t)2 (y |x) = y d(y |x).
tR

6 Experimental validation
In this section we will only preliminarily validate the MGL algorithm (3) on the problem of variable selection and covariance measurement. By the representer Theorem 3 in Section 3, the solution of -  -  MGL denoted by fz = (f1,z , f2z ) = (f1,z , f2,z , . . . , fd+1,z ) j -  can be rewritten as fz = Nm Kxj cj,z . Since it only belongs to a vector-valued RKHS HK , we need to find a common criterion inner product (norm) ·, · r to measure each -  component of the learned gradient f2z = (f2,z , . . . , fd+1,z ). Once we find the criterion inner product ·, · r, we can use the coordinate covariance f p -  Cov(f2z ) = (28) p+1,z , fq +1,z r
,q Nd

Hence, in this case the purpose of error analysis is to show -  that solution fz of LSMGL algorithm (4) statistically con-  verges to f = (f , f ) as m  , s = s(m)  0 and  = (m)  0. To this end, we introduce some notations and the following hypotheses are assumed to be true throughout this section. Firstly, we assume that Y  [-M , M ] with M > 0. Since X is compact the diameter of X denoted by D = supx,uX x - u 2 is finite. Secondly, denote by L2X the  -  space of square intX ral functions f : X  Rd+1 with eg   - (x) 2dX (x). Finally, denote the f norm - 2X = f boundary of X by  X . We assume, for some constant c > 0, that the marginal distribution satisfies that x  X  X : dist(x,  X ) < s c s, 0 < s < D. (26)

to measure how the variables covary. Also, the variable (feature) ranking can be done according to the following relative -  magnitude of norm of each component of f2z : sp = ( q fp+1,z r . 2 1 /2 Nd fq +1,z r ) (29)

If the scalar kernel G is a linear kernel then every component -  of f2z is a constant. In this case, we can choose the standard Euclidean inner product to be the criterion inner product (norm). When the kernel G is an RBF kernel, we show in the following proposition that we can select the criterion inner product ·, · r to be the RKHS inner product ·, · G in HG . The computation is summarized in the following proposition.


Proposition 1 Suppose the scalar kernel G has a feature representation and the multi-task kernel K is defined by equaj -  tion (6). Then, for any solution fz = Nm Kxj cj,z  HK of MGL algorithm (3), the following hold true. 1. If G is a linear kernel then the coordinate covariance is defined by -  - T-  i (xi , Id )ci,z cT,z (xj , Id )T . Cov(f2z ) = f2z f2z = j
,j Nm

Moreover, for LSMGL algorithm the above equation can be more efficiently computed by i V -  T Cov(f2z ) = V (i , Is )i j (j , Is )T T .
,j Nm

in this parameter made little difference to feature ranking). The parameter s in the weight coefficients wij is set to be the median pairwise distance between inputs. In Figure 1, the result of LSMGL is shown in (b) for variable covariation and in (c) for feature selection respectively. We also ran the algorithm (3) with the choice of kernel K(x, t) = G(x, t)Id+1 ([15, 16, 17]). The results are shown in (d) and (e) of Figure 1. We see that both algorithms worked well. The LSMGL algorithm works slightly better: the reason maybe be that it captures the inherent structure of gradient learning as mentioned before. We also ran LSMGL algorithm on this dataset; the result is no essentially different from SVMMGL. In the second experiment, we use the SVMMGL algorithm for classification. For this dataset, only the first two features are relevant to the classification task. The remaining 78 redundant features are distributed according to a small Gaussian random deviate. The distribution for the first two features is shown in (f). In SVMMGL, the parameter s and  are the same as those in the first example. The scalar kernel is 2 2 set to be a Gaussian G(x, t) = e- x-t /2 where  is also the median pairwise distance between inputs. The feature selection results for the SVMMGL algorithm are illustrated respectively in (g) and (h) with different choices of multi-task kernels K given by equation (6) and K(x, t) = G(x, t)Id+1 . Both algorithms picked up the two important features. Finally, we apply our LSMGL algorithm to a well-studied expression dataset. This dataset has two classes: acute myeloid leumekia (AML) and acute lymphoblastic leukemia (ALL), see e.g. [10]. There are a total of 7129 genes (variables) and 72 patients, split into a training set of 38 examples and a test set of 34 examples. In the training set, 27 examples belong to ALL and 11 belong to AML, and the test set is composed of 20 ALL and 14 AML. Various variable selection algorithms have been applied to this dataset by choosing features based on training set, and then performing classification on the test set with the selected features. We ran LSMGL with the choice of multi-task K given by equation (6) where G is -  a linear kernel. The solution fz is learned from the training set for ranking the genes according to the values of sp defined by equation (29). Then, ridge regression is run on the training set with truncated features to build a classifier to predict the labels on the test set. The regularization parameter of LSMGL is fixed to be 0.1 while the regularization parameter in ridge regression is tuned using leave-one-out crossvalidation in the training set. The test error with selected top ranked genes is reported in Table 2. The classification accuracy is quite comparable to the gradient learning algorithm using individual RKHSs [15, 17]. However, [11, 15, 17] did the recursive techniques to rank features and employed SVM for classification while our method showed that ridge regression for classification and non-recursive technique for feature ranking also worked well in this data set. It would be interesting to further explore this issue. The preliminary experiments above validated our proposed MGL algorithms. However further experiments need to be performed to evaluate our multi-task framework for gradient learning.

2. If G is a smooth RBF kernel then fp+1,z  HG and the ( p -  coordinate covariance Cov(f2z ) = fp+1,z , fq+1,z G ,qN d can be computed by
T fp+1,z , fq+1,z G = Cz (Kpq (xi - xj ))mj =1 Cz , i,

(30)

where the kernel matrix Kpq (xi - xj ) defined, for any i, j  Nm , by . -2 2 (pq G)(xi - xj ), (( pq G)(xi - xj ))T 2 2 -( pq G)(xi - xj ), ( 2pq G)(xi - xj ) The proof is postponed to the appendix where the computation of Kpq is also given if G is a Gaussian. We run our experiment on two artificial datasets and one gene expression dataset following [17]. In the first experiment, the target function f : Rd  R with notation x = (x1 , . . . , xd )  Rd and d = 80. The output y is contaminated by a Gaussian noise y = f (x) + ,  N (0, y ). As depicted in Figure 1 (leftmost), the thirty inputs whose relevant features are [1, 10]  [11, 20]  [41, 50] are generated as follows: 1. For samples from 1 to 10, xp  N (1, 0.05), for p  [1, 10] and xp  N (0, 0.1), for p  [11, 80]. 2. For samples from 11 to 30, xp  N (1, 0.05), for p  [11, 20] and xp  N (0, 0.1), for p  [1, 10]  [31, 80]. 3. For samples from 11 to 30, features are in the form of xp  N (1, 0[05), for p  [41, 50], and xp  N (0, 0.1), for . p  [1, 40] 51, 80]. We let the regression function f to be a linear function. Specifically, we choose a noise parameter y = 3 and the T regression function is defined by f (xi ) = w1 xi for i  T T [1, 10], f (xi ) = w2 xi for i  [11, 20], and f (xi ) = w3 xi k for i  [21, 30], where, w1 = 2 + 0.5sin(2 k /10) for k  k [1, 10] and otherwise zero, w2 = -2 - 0.5sin(2 k /10) for k  [11, 20] and zero otherwise. The vector w3 is defined k by w3 = -2 - 0.5sin(2 k /10) for k  [41, 50] and zero otherwise. In this linear case, we use the kernel G(x, t) = xT t as a basic scalar kernel and the multi-task kernel K defined by (6) in LSMGL algorithm (4). As in [11, 15], the regularization parameter  is set to be a fixed number such as 0.1 (variation


Data matrix

Coordinate Covariance for LSMGL 0.35

LSMGL Features ranks

Coordinate Covariance LSDGL 1.2 10 20 1 0.8 0.6 30 0.4 0.2 0 -0.2 60 -0.4 70 80

10 20 30
Coordinate

1

10 20 30 40 50

8 0.3 6 0.25
Relative norm

0.8

0.6

4

0.2 40 0.15 50 0.1

40 50 60 70

0.4

2

0.2

0 60 -2 70 0.05 -4

0

-0.2 80 5 10 15 Sample 20 25 30 80 10 20 30 40 50 60 70 80 0 0 10 20 30 40 50 Coordinate 60 70 80 10 20 30 40 50 60 70 80

-0.6 -0.8

(a)
LSDGL Features ranks 0.35 3 0.3 2

(b)
0.124 0.122

(c)
MTKGL 0.5 0.45 0.4

(d)
DFKGL

0.25
Relative norm

1
Relative RKHS norm

0.12
Relative RKHS norm

0.35 0.3 0.25 0.2 0.15 0.1

0.2

0

0.118

0.15

-1

0.116

0.1

-2

0.114

0.05

-3

0.112 0.05

0

0

10

20

30

40 50 Coordinate

60

70

80

-4 -4

-3

-2

-1

0

1

2

3

4

0.11

0

10

20

30

40 50 Coordinate

60

70

80

0

0

10

20

30

40 50 Coordinate

60

70

80

(e)

(f)

(g)

(h)

Figure 1: LSMGL and SVMMGL feature ranking genes test error genes test error 10 2 1000 2 40 1 2000 1 80 0 3000 1 100 0 4000 1 200 1 6000 1 500 1 7129 1 function.

References
[1] R. K. Ando & T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. J. Machine Learning Research, 1817­1853, 2005 [2] A. Argyriou, T. Evgeniou, & M. Pontil. Multi-task feature learning. NIPS, 2006. [3] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc. 68: 337­404, 1950. [4] P. L. Bartlett & S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. J. of Machine Learning Research, 3:463­482, 2002. [5] S. Ben-David & R. Schuller. Exploiting task relatedness for multiple task learning. COLT, 2003. [6] D. R. Chen, Q. Wu, Y. Ying, & D. X. Zhou. Support vector machine soft margin classifiers: Error analysis. J. of Machine Learning Research, 5:1143­1175, 2004. [7] A. Caponnetto, C.A. Micchelli, M. Pontil, & Y. Ying. Universal multi-task kernels, Preprint, 2007. [8] S.S. Chen, D.L. Donoho & M.A. Saunders. Atomic decomposition pursuits. SIAM J. of Scientific Computing, 20: 33-61,1999. [9] F. Cucker & S.Smale. On the mathematical foundations of learning, Bull. Amer. Math. Soc. 39: 149, 2001. [10] T. R. Golub et. al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286: 531-537, 1999. [11] I. Guyon, J.Weston, S. Barnhill, & V. Vapnik. Gene selection for cancer classification using support vector machines. Machine learning 46: 389­422, 2002. [12] V.I. Koltchinskii & D. Panchenko. Rademacher processes and bounding the risk of function learning. In J.

Table 2: Number of test error using ridge regression algorithm versus the number of top ranked genes selected by LSMGL algorithm.

7

Conclusions

In this paper, our main contribution was to provide a novel unifying framework for gradient learning from the perspective of multi-task learning. Various variable selection methods in the literature can be recovered by the choice of multitask kernels. More importantly, this framework allows us to introduce a novel choice of multi-task kernel to capture the inherent structure of gradient learning. An appealing representer theorem was presented which facilitates the design of efficient optimization algorithms, especially for datasets with high dimension and few training examples. Finally, a statistical error analysis was provided to ensure the convergence of the learned function to true function and true gradient. Here we only preliminarily validated the method. A more extensive benchmark study remains to be pursued. In future we will explore more experiments on biomedical datasets and compare our MGL algorithms with previous related methods for feature selection, such as those in [21, 22] etc. It will be interesting to implement different loss functions in the MGL algorithms for regression and classification, apply the spectral decomposition of the gradient outer products to dimension reduction (see e.g. [16]), and possible use for network inference from the covariance of the learned gradient


Wellner E. Gine, D. Mason, editor, High Dimensional Probability II, pages 443459, 2000. [13] T. Evgeniou, C. A. Micchelli & M. Pontil. Learning multiple tasks with kernel methods. J. Machine Learning Research, 6: 615­637, 2005. [14] C. A. Micchelli & M. Pontil. On learning vector-valued functions. Neural Computation, 17: 177-204, 2005. [15] S. Mukherjee & Q. Wu. Estimation of gradients and coordinate covariation in classification. J. of Machine Learning Research 7: 2481-2514, 2006. [16] S. Mukherjee, Q. Wu, & D. X. Zhou. Learning gradients and feature selection on manifolds. Preprint, 2007. [17] S. Mukherjee & D. X. Zhou. Learning coordinate covariances via gradients, J. of Machine Learning Research 7: 519-549, 2006. [18] I. J. Schoenberg. Metric spaces and completely monotone functions, Ann. of Math. 39: 811-841, 1938. ¨ [19] B. Scholkopf & A. J. Smola. Learning with Kernels. The MIT Press, Cambridge, MA, USA, 2002. [20] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, 2004. [21] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B. 58: 267-288, 1996. [22] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, & V. Vapnik. Feature selection for SVMs, NIPS, 2001. [23] V. N. Vapnik. Statistical learning theory. Wiley, New York, 1998.

(x), (p q )(xj ) 2. Denote cj by (c1 , . . . , cd+1 )T . Conj j sequently, for any p  Nd , fp+1,z (·) = wp , (·) 2 with q j q +1 1 wp = - ). ThereNm (p (xj )cj + Nd q p (xj )cj fore, for any p, q  Nd we have that fp+1,z  HG and i fp+1,z , fq+1,z G = wp , wq 2 = cT Kpq (xi - xj )cj i
,j Nm

C which completes the assertion. omputation of kernel K. If G is a Gaussian kernel with standard variation  , that is, x-t 2 for any x, t  Rd , G(x, t) = G(x - t) = e- 2 , the computation of K xs listed as follows. i G
2 1. (pq G)(x) = 
p xq 2

-

pq 

(x)
x
+x

q 2. For any q Nd , q p q G(x) = [ q pq p2 q q pq - xp xq xq  ]G(x). Hence, 3 e G xpq xp xq x p xq + eq xp 2 pq G(x) = + 2- (x) 2  3   2 3. For any p , q  Nd , p q pq G(x) = G(x) pq 2qp + x + pp qq +p q pq 1 - 3 q xq pp + xp xq p q + xp xq p q 2 xp xq xq . x x  +x  +x   p - q pq q 2pq p qq Hence, 3


+x


Appendix
Let G be a scalar kernel, we use the convention p G to denote the p-th partial derivative of G with respect to the second argument, and so is the gradient (2) G. Proof of Proposition 1. When G is a linear kernel, by the definition (6) of multij -  task kernel K, we have that f2z = Nm (xj , Id )cj,z which implies that -  --T i  C ov (f2z ) = f2z f2z = (xi , Id )ci,z cT,z (xj , Id )T . j
,j Nm (2)

eT ep +eT eq +pq Id p 2 2pq G(x) = G(x) q 2 -1 x T T - 3 q (ep x + xep ) + xp (xeT + eq xT ) xx .q xp xq Id xT + x3 p q - pq 3  Proof of Theorem 7 We turn our attention to the proof of Theorem 7. We begin with some notations and background materials. First denote by Erz the empirical loss in LSMGL algorithm, i.e., i -  1 Erz ( f ) = m2 ,j w (xi - xj ) -  ×(yi - f1 (xj ) - f2 (xj )(xi - xj )T )2 , and the modified form of its expectation y ZX -  Er( f ) = w(x - u) - f1 (u) 2 -  - f2 (u)(x - u) d(x, y )dX (u). Since the Gaussian weight w(x - u) = ws (x - u) is de-  pendent on s, the above definition of Er( f ) is depending on thefpaL meter s. In addition, define the Lipschitz constant ra f (x +  ip to be the minimum constant c such that u) - f (x)  c u , f xLu  X. We say that f is , Lipschitz continuous if  ip is finite. The error analysis here is divided into two main steps motivated by the techniques in [15]. The first step is to bound the -  -  -  square error - - f 2X by the excess error Er(fz )-Er(f ). fz  In the second step, we employ standard error decomposition [6] and Rademacher complexities [4, 12] to estimate the excess error. These two steps will be respectively stated in the

For the LSMGL algorithm, in Section 4 we showed that cj,z = V j,z and, for any j  Nm , xj = V j , the above equation can be further simplified to the following: i V -  T (i , Is )i,z j,z (j , Is )T T . Cov(f2z ) = V
,j Nm

When G is an RBF kernel, for any x, t  Rd and p, q  Nd , G(x, t) = G(x - t) and tq xp G(x, t) = -(p q G)(x, t). Hence, for any p  Nd and x  Rd , we have that j - (2) c ( fp+1,z (x) = p G(x, xj ), - (2) p2) G(x, xj ) j
Nm

Since G has a feature representation, i.e., for any x, t  X , there holds that G(x, t) = (x), (t) 2. Also, observe that (2) (2) (2) p G(x, xj ) = (x), (p )(xj ) 2 and q p G(x, xj ) =


following two propositions. Before we do that, we introduce an auxiliary functional Qs defined by f w --  Qs ( f , f  ) = (x - u)  (u) - f1 (u) 2 -  +( f2 (u) - f (u))(u - x)T dX (x)dX (u). We are ready to present the first step of the error analysis: -  bounding the square error - - f 2X by the excess error fz  -  -  Er(fz ) - Er(f ) which is stated as the following proposition. Proposition 2 If 0 < s,  < c such that -  -  E fz - f 2X c  + E s -2 fz K 1 then there exists a constant ms × in - , max p-1 (x) E -xX  -  E r(fz ) - Er(f ) f2 + . - + f  Lip

and

XX --  Qs (fz , f ) = w(x - u) (- 2  -  × f (u) - fz (u))(e1 + u - x) dX (x)dX (u) u (- 2 X  -  s f (u) - fz (u))(e1 + u - x) -x s d ×dX (x) X (u) u (- 2 X  -   c (s) s f (u) - fz (u))(e1 + u - x) -xd s ×dx X (u). (34) The integral w.r.t. x on the right-hand side of the above in-  -  -  equality can be written as (f (u) - fz (u))W (s)(f (u) - -  fz (u))T with (d + 1) × (d + 1) matrix W (s) defined by u ( d W (s) = e1 + u - x)(e1 + u - x)T x.
-x s

The proof of this proposition follows directly from the following Lemmas 8 and 9. For this purpose, let the subset Xs of X be u ( Xs =  X : dist(u,  X ) > s, |p(u)|  (1 + c )s 31) and cp (s) := min{p(x) : u - x  s, u  Xs }. Recall that e1 is the first coordinate basis in Rd+1 and, for any x  Rd , xT = (0, x)T  Rd+1 . Lemma 8 that + E If 0 < s < 1 then there exists a constant c such

-  sE -    fz - f 2X c  [ - 2 ] + -  fz K f  Q . s- E --  min , max p-1 (x) s ( fz , f )
xX

Here, (e1 + u - x)(e1 + u - x)T equals that O 1 (u - x)T u - x (u - x)(u - x)T u t t2 bserve that w(x - u)dx = s-2 e- 2 dt -x s 1 u and w(x - u)(u - x)dx = 0. In addition, for any -x s u p = q  Nd , w(x - u)(up - xp )(uq - xq )dx = 0 -x s u t t2 w(x - u)(xp - up )2 dx = e- 2 (tp )2 dt -x s 1 From the above observations, there exists a constant c such that -  -  -  -   -  (f (u)-fz (u))W (s)(f (u)-fz (u))T  c - (u)-fz (u) 2. f Recalling the definition of W (s) and substituting this back into equation (34) implies, for any 0 < s < 1, that X  -  --  c c (s) - (u) - f (u) 2dX (u)  Qs (fz , f ). fz
s

-  Proof: Write - - f X by fz 2 X -   -  - - f 2X = \Xs - (u) - f (u) 2dX (u) fz fz  X -  + s - (u) - f (u) 2dX (u) fz

Plugging this into equation (34), the desired estimate follows from the estimation of c (s), i.e., equation (33). Now we can bound Qs by the following lemma. Lemma 9 If 0 < s < 1 then there exists a constant c such -  that, for any f  HK , the following equations hold true. s f2 E- --   - . 1. Qs ( f , f )  c 2 r( f ) - Er(f )  Lip + s . f2 -  -  --  2. Er( f ) - Er(f )  c 2  Lip + Qs ( f , f ) y 2 y -  Proof: Observe that - f1 (u) - f2 (u)(x - u)T = - 2 y f (u) - f (u)(x - u)T + 2 - f (u) - f (u)(x - f + -  u)T  (u) - f1 (u) + ( f2 (u) - f (u))(u - x)T f (u) - 2 -  T f1 (u) + ( f2 (u) - f (u))(u - x) . Then, taking the integral of both sides of the above equality and using the fact Y that f (x) = y dX (x) we have that f XX -  -  --  Er( f ) - Er(f ) = Qs ( f , f f + 2 ) w(x - u)  (x) -f (u) - f (u)(x - u)T  (u) - f1 (u)+ d -  ( f2 (u) - f (u))(u X x)T X (x)dX (u) - f X --   Qs ( f , f ) - 2 w(x - u)  (x) - f (u) 1Q 1 2 2 --  2 - f (u)(x - u)T dX (x)dX (u) . s ( f , f )|

(32) By the definition of Xs , we have that X (X \Xs )  c s + c (1 + c )|X |s  c s where |X | is the Lebesgue measure of X . Hence, the first term of the above equation is bounded by     2c ( -  + -  )s  2c ( - K + -  )s . fz 2 f 2 fz 2 f 2 For the second term on the right-hand side of equation (32), observe that, for any u  Xs , dist(u,  X ) > s and {u : u - x  s, x  Xs }  X . Moreover, for any x  X such that u - x  s, by the definition of Xs there holds p(x) = p(u)-(p(u)-p(x))  (1+c )s -c u-x Consequently, it follows that cp (s)  max(s , min p(x)),
xX 

 s .

(33)


Applying the inequality, for any a, b > 0, that -2a2 - 1 b2  2 -2ab, from the above equality we further have that XX -  -  --  w(x - u) Er( f ) - Er(f )  1 Qs ( f , f ) - 2 2 f 2 f (u)(x - u)T dX (x)dX (u).  (x) - f (u) - (35) Likewise, f XX -  -  --  Er( f ) - Er(f ) = Qs ( f , f f + 2 ) w(x - u)  (x) -f (u) - f (u)(x - u)T  (u) - f1 (u)+ d -  ( f2 (u) - f (u))(u X x)T X (x)dX (u) - f X --   Qs ( f , f ) + 2 w(x - u)  (x) - f (u) 1Q 1 2 2 --  2 - f (u)(x - u)T dX (x)dX (u) . s ( f , f )| Applying the inequality 2ab  a2 + b2 to the above inequality yields that XX -  -  --  Er( f ) - Er(f )  2Qs ( f , f ) + w(x - u) 2 f T f (u)(x - u) dX (x)dX (u)  (x) - f (u) - 1 f (36) However, |f (x)-f (u)- f (u)(x-u)T | = 0  (tx+  x T 2 (1-t)u)- f (u) - u dt | f |Lip x-u and the density p(x) of X is a bounded function since we assume it ¨ is -Holder continuous and X is compact. Therefore, f 2 XX w(x - u)  (x) - f (u) - f (u)(x - u)T ×dX (x)dX (u)  R x2 - 2s2 1  p  | f |2 ip x 4 dx d sd+2 e L cp
|

E- -  -    Proof: Note that Er(fz ) - Er(f ) +  - 2 = r(fz ) - fz K +E- - -  -  -    Erz (fz ) + Erz (f ) - Er(f ) rz (fz ) +  - 2 fz K E- + .   -  -   rz (f ) +  - 2 f K Er(f ) - Er(f ) +  - 2 By the f K -  definition of fz , we know that the second term in parenthesis on the right-hand side of the above equation is negative. -  -  Hence, by the definition of f , we get that Er(fz ) - Er(f ) + E- .   -    - 2  S (z) + inf f HK r( f ) - Er(f ) +  - 2 fz K fK -  By the property (2) in Lemma 9( we also have, for any f  , w f2 -  -  --  HK , that Er( f ) - Er(f )  c s2  Lip + Qs ( f , f ) hich implies that . s f2 -  -   +A(, s) inf {Er( f )-Er(f )+ - 2 }  c 2 fK  Lip This completes the proposition. Now it suffice to estimate the sample error S (z). To this -   end, observe that Erz (fz ) +  - 2  Erz (0) +  0 2  fz K K 2  M - K  M s-(d+2)/2 . Likewise, fz d+2 which implies that s 2 -    Er(f ) +  - K  Er(0) +  0 2 - K  sM 2 which f 2 d+ K f  tells us that - K  M s-(d+2)/2 . Using these bounds on fz   - K + - K, we can use the Rademacher averages (see fz f e.g. [4, 12]) for its definition and properties) to get the following estimation for the sample error. Lemma 10 For any 0 <  < 1, there exists a constant c such that, for any m  N, there holds . 1 S  1  E (z) c 2(d+2) + s m s3(d+2)/2 m Since the proof of the above lemma is rather a standard approach and indeed parallel to the proof of Lemma 26 (replacing r = M s-(d+2)/2 there) in the appendix of [15], for the simplicity we omit the details here. We have assembled all the materials to prove Theorem 7. Proof of Theorem 7  -  Since we assume that f  HK , for any p f (x + u) - - =e  -  -  -  p f (u) p+1 , f (x+u)- f (u) = f , Kx+u ep+1 - e  K Ku ep+1 K - K T+1 (x + u, x + u) + K(u, u) - f p 1 e 2  K(x + u, u) - K(u, x + u) p+1  c - K u , and hence f  | f |Lip  c - K. Moreover, f --    A(, s)  Q(f , f ) +  - 2 =  - 2 . f K f K Hence, we know from Propos. tion 3 and equation (37) that i   - 2  S (z) + c s2 +  fz K Combining the above equations with Propositions 2 and 3, there exists a constant c such that --  ×m   E fz - f X 2 c in(s- , maxxX p-1 (x)) + s  . E +2 + S (z) s + s If we choose  = s2 and s = m- 3(d+2)+4 yields the first assertion. If X is the uniform distribution over X , then we have that min(s- , min p(x)) = 1. Hence, choosing  = s and
xX
1 1

f |2 ip s2 . L

Putting this into Equations (35) and (36) and arranging the terms involved yields the desired result. -  From Property (1) of Lemma 9, for any f  HK we have that -  -  Er( f ) - Er(f )  -cs2 | f |2 ip . (37) L We now turn our attention to the second step of the error -  -  analysis: the estimation of the excess error Er(fz ) - Er(f ) +   - 2 . To do this, let fz K E B -  -   f = arg inf r( f ) +  - 2 fK -  f HK y the error decomposition technique in [6], we get the following estimation. -  Proposition 3 If f is defined above then there exists a constant c such that -  -   Er(fz ) - Er(f ) +  - 2  S (z) f s2 f 2z K , +c  Lip + A(, s) where -  -  -  -  S (z) = Er(fz ) - Erz (fz ) + Erz (f ) - Er(f ) is referred to the sample error and Q i --   A(, s) = inf f2 s ( f , f ) +  - K -  f HK s called the approximation error.

s = m- 3(d+2)+5 we have the desired second assertion. This completes the theorem.