Sparse Recovery in Large Ensembles of Kernel Machines Vladimir Koltchinskii School of Mathematics, Georgia Institute of Technology vlad@math.gatech.edu Ming Yuan School of Industrial and Systems Engineering, Georgia Institute of Technology myuan@iyse.gatech.edu Abstract A problem of learning a prediction rule that is approximated in a linear span of a large number of reproducing kernel Hilbert spaces is considered. The method is based on penalized empirical risk minimization with 1type complexity penalty. Oracle inequalities on excess risk of such estimators are proved showing that the method is adaptive to unknown degree of "sparsity" of the target function. f exists and it is uniformly bounded. We shall also assume the uniqueness of f in the following discussion. In the case when the distribution P of (X, Y ) is unknown, it has to be estimated based on the training data which (in the simplest case) consists of n independent copies (X1 , Y1 ), . . . , (Xn , Yn ) of (X, Y ). Let Pn denote the empirical distribution based on the training data. Then the risk P ( · f ) can be estimated by the empirical risk n -1 jn =1 (Yj , f (Xj )) = Pn ( · f ). 1 Intro duction Let (X, Y ) be a random couple in S ×T , where (S, S ), (T , T ) are measurable spaces. Usually, T is either a finite set, or a subset of R (in the first case, T can be also identified with a finite subset of R). Most often, S is a compact domain in a finite dimensional Euclidean space, or a compact manifold. Let P denote the distribution of (X, Y ) and denote the distribution of X. In a general framework of prediction, X is an observable instance and Y is an unobservable label which is to be predicted based on an observation of X. Let : T × R R+ be a loss function. It will be assumed in what follows that, for all y T , the function (y; ·) is convex. Given f : S R, denote ( · f )(x, y ) := (y, f (x)) and define the (true) risk of f as E (Y ; f (X )) = P ( · f ). The prediction problem then can be formulated as convex risk minimization problem with the optimal prediction rule f defined as f := argminf :S R P ( · f ) where the minimum is taken over all measurable functions f : S R. It will be assumed in what follows that The direct minimization of the empirical risk over a large enough family of function f : S R almost inevitably leads to overfitting. To avoid it, a proper complexity regularization is needed. In this paper, we will study a problem in which the unknown target function f is being approximated in a linear span H of a large dictionary consisting of N reproducing kernel Hilbert spaces (RKHS) H1 , . . . , HN . It will be assumed that we are given N symmetric nonnegatively definite kernels Kj : S × S R, j = 1, . . . , N and that Hj is the RKHS generated by Kj : Hj = HKj . Suppose, for simplicity, that Kj (x, x) 1, x S, j = 1, . . . , N . The space c jN H := l.s. Hj =1 onsists of all functions f : S R that have the following (possibly, non-unique) additive representation f = f1 + . . . fN , fj Hj , fj Hj , j = 1, . . . , N and it can be naturally equipped with the 1-norm: jN f 1 := f 1 (H) := inf fj Hj : f =1 = jN =1 . fj , fj Hj , j = 1, . . . , N Partially supported by NSF grant DMS-0624841. Partially supported by NSF grant DMS-0624841. Additive models are a well-known special case of this formulation. In additive models, S is a subset of RN , i.e., X = (x1 , . . . , xN ) , and Hj represents a functional space defined over xj . Several approaches have been proposed recently to exploit the sparsity in additive models (Lin and Zhang, 2006; Ravikumar et al., 2007; Yuan, 2007). In this paper, we consider an extension of 1 penalization technique to a more general class of problem. In particular, we study the following penalized empirical risk minimization problem: P , ^ := argmin f (1.1) n( · f ) + f 1 f H where > 0 is a small regularization parameter. Equivalently, this problem can be written as ^ ^ (f1 , . . . , fN ) := argminfj Hj ,j =1,...,N P jN n ( · (f1 + · · · + fN )) + =1 (1.2) . fj Hj The choice of 1-norm for complexity penalization is related to our interest in the case when the total number N of spaces Hj in the dictionary is very large, but the target function f can be approximated reasonably well by functions from relatively small number d of such spaces. The 1-penalization technique has been commonly used to recover sparse solutions in the case of simple dictionaries that consist of one-dimensional spaces Hj (see, e.g, Koltchinskii (2007) and references therein). The goal is to extend this methodology to more general class of problems that include aggregation of large ensembles of kernel machines and sparse additive models. In the case of additive models with the quadratic loss, (1.1) becomes the so-called COSSO estimate recently introduced by Lin and Zhang (2006). For f H, define the excess risk of f as E (f ) = P ( · f ) - P ( · f ) = P ( · f ) - inf P ( · g ). g :S R According to the representer theorem (Wahba, 1990), ^ the components of the minimizer fj have the following representation: ^ fj ( x) = in =1 ^ ^ Our main goal is to control the excess risk of f , E (f ). Throughout the paper, we shall also make the following assumption n N en for some > 0. s It will also be assumed that the loss function atisfies the following properties: for all y T , (y, ·) is twice differentiable, u is a uniformly bounded function in T × R, sup (y; 0) < +, sup u(y; 0) < + y T y T cij Kj (Xi , x) ^ for some real vector ^j = (cij : i = 1, . . . , n). In other c ^ words, (1.2) can be rewritten as a finite dimensional convex minimization problem over (cij : i = 1, . . . , n; j = 1, . . . , N ). It is known (see, e.g., Micchelli and Pontil, 2005) that f , f 1 (H) = inf K : K conv{Kj : j = 1, . . . , N } where · K denote the RKHS-norm generated by symmetric nonnegatively definite kernel K and conv{Kj : j = 1, . . . , N } := jN =1 and (R) := 1 inf inf u(y, u) > 0, R > 0. 2 yT |u|R (1.4) We also assume without loss of generality that, for all R, (R) 1. These assumptions imply that | u(y, u)| L1 + L|u|, y T , u R with some constants L1 , L 0 (if u is uniformly bounded, one can take L = 0). The following bound on the excess risk holds under the assumptions on the loss: ( f f ) f - f 2 2 () L E (f ) C f - f L2 () 2 (1.5) with a constant C > 0 depending only on . This bound easily follows from a simple argument based on Taylor expansion and it will be used later in the paper. The quadratic loss (y, u) := (y - u)2 in the case when T R is a bounded set is one of the main examples of such loss functions. In this case, (R) = 1 for all R. In regression problems with a bounded response variable, more general loss functions of the form (y, u) := (y - u) can be also used, where is an even nonnegative convex twice continuously differentiable function with uniformly bounded in R, (0) = 0 and cj Kj : . cj 0, jN =1 cj = 1 Therefore (1.2) can be also written as ^^ (f , K ) := argminK conv(Kj ,j =1,...,N ) argminf HK (1.3) P , ( · f) + f K n leading to an interpretation of the problem as the one of learning not only the target function f , but also the kernel K in the convex hull of a given dictionary of kernels (which can be viewed as "aggregation" of kernel machines). Similar problems have been studied recently by Bousquet et al. (2003), Cramer et al. (2003), Lanckriet et al. (2004), Micchelli and Pontil (2005) and Srebro and Ben-David (2006) among others. (u) > 0, u R. In classification problems, the loss function of the form (y, u) = (y u) is commonly used, with being a nonnegative decreasing convex twice continuously differentiable function such that, again, is uniformly bounded in R and (u) > 0, u R. The loss function (u) = log2 (1 + e-u ) (often referred to as the logit loss) is a specific example. We will assume in what follows that H is dense in L2 (), which, together with (1.5), implies that f H 2 Bounding the 1-norm Our first goal is to derive upper bounds on f 1 that ^ hold with a high probability. In what follows we use the notation ( · f )(x, y ) := u(y, f (x)), where u(y, u) is the derivative of second variable. with respect to the inf P ( · f ) = f L2 () inf P ( · f ) = P ( · f ). We also need several basic facts about RKHS which can be found in, for example, Wahba (1990). Let K be a symmetric nonnegatively definite kernel on S × S with sup K (x, x) 1 x S Theorem 1 There exists a constant D > 0 depending only on such that for al l A 1 and for al l > 0 and f H satisfying the condition A 4 log N D ·f max sup |P (( · f )hk ) |, 1kN hk H 1 n k the fol lowing bound holds ^ P f 1 3 f 1 and HK be the corresponding RKHS. Given a probability measure on S , let k , k 1 be the orthonormal system of fuctions in L2 () such that the following spectral representation (as in Mercer's theorem) holds: K (x, y ) = k =1 N -A . (2.1) k k (x)k (y ), x, y S, A log N In particular, if D · f /4 , then n ^ P f 1 3 f /4 1 N -A . (2.2) ^ Pro of. By the definition of f , for all f H, ^ Pn ( · f ) + f 1 Pn ( · f ) + f 1. ^ The convexity of the functional f Pn ( · f ) implies that ( . ^ ^ Pn ( · f ) - P n ( · f ) Pn · f )(f - f ) As a result, f 1 ^ which is true under mild regularity conditions. Without loss of generality we can and do assume that {k } is a decreasing sequence, k 0. It is well known that for f , g HK , f, g HK = k =1 f, k L2 () g, k L2 () . k Denote HD HK the linear span of functions f HK such that k f, k 2 2 () L < 2 k =1 f 1 + Pn ( ^ · f )(f - f ) sup |Pn (( · f )hk ) | × f 1 + max 1kN hk H 1 k and let D : HD L2 () be a linear operator defined as follows: Df := k =1 × f - f 1. ^ It follows that - max + max 1kN hk H 1 k f, k L2 () k , f HD . k 1kN hk H 1 k sup |Pn (( · f )hk ) | |Pn (( · f )hk ) | ^ f 1 f 1. Then we obviously have f, g HK = Df, g L2 () , f HD , g HK . Given a dictionary {H1 , . . . , HN } of RKHS, one can quite similarly define spectral representations of ker(j ) nels Kj with nonincreasing sequences of eignevalues {k : (j ) k 1} and orthonormal in L2 () eigenfunctions {k : k 1}. This also defines spaces HDj and linear operators Dj : HDj L2 () such that f, g Hj = Dj f , g L2 () , f HDj , g HKj . sup Under the assumption > max this yields f 1 ^ + max1kN sup - max1kN sup hk Hk 1 hk Hk 1 1kN hk H 1 k sup |Pn (( · f )hk ) |, |Pn (( · f )hk ) | |Pn (( · f )hk ) | (2.3) f 1. Note that 1kN hk H 1 k max sup |Pn (( · f )hk ) | |(Pn - P )( · f )hk | + sup |P (( · f )hk ) |. 1kN hk H 1 k max sup with constants C, L depending only on (if is uniformly bounded, L = 0). Since, by the necessary conditions of minimum at f , P (( · f )hk ) = 0, hk Hk , k = 1, . . . , N , we also have 1kN hk H 1 k + max Also, for any i = 1, . . . , N sup hi Hi 1 1kN hk H 1 k max sup sup |P (( · f )hk ) | |P ( · f ) - ( · f ) h k| |(Pn - P )( · f )hi | n -1 = 1kN hk H 1 k max = = sup hi Hi 1 jn =1 ( · f )(Xj , Yj ) hi , Ki (Xj , ·) Hi C f - f L2 () -E( · f )(Xj , Yj ) hi , Ki (Xj , ·) Hi n -1 jn =1 ( · f )(Xj , Yj )Ki (Xj , ·) H . i -E( · f )(Xj , Yj )Ki (Xj , ·) where we used the fact that u(y, u) is Lipschitz with respect to u. Therefore, the condition on in (2.1) is satisfied if A log N D(1 + f 1) n and f - f L2 () /D with a properly chosen D (depending only on ). Using Bernstein's type inequality in Hilbert spaces, we are easily getting the bound 1kN hk H 1 k 3 Oracle inequalities f 1 R ^ max C sup |(Pn - P )( · f )hk | A log N n A log N n In what follows we will assume that R > 0 is such that ¯ with probability at least 1 - N -A . In particular, if f H satisfies the assumption of Theorem 1, i.e., A 4 ( | log N ¯ ¯ max sup |P · f )hk , D ·f 1kN hk H 1 n k then one can take R = 3 f 1. ¯ We need some measures of dependence (in a probabilistic sense) between the spaces Hj , j = 1, . . . , N . In the case of a simple dictionary {h1 , . . . , hN } consisting of N functions (equivalently, N one-dimensional spaces) the error of sparse recovery depends on the Gram matrix of the dictionary in the space L2 () (see, e.g., Koltchinskii (2007)). A similar approach is taken here. Given hj Hj , j = 1, . . . , N and J {1, . . . , N }, denote by ({hj : j J }) thei minimal eigenvalue of the Gram h matrix i , hj w ·f a ith probability at least 1 - N -A . As soon as A A log N log N 4C · f n n nd 4 max 1kN hk H k sup |P (( · f )hk ) |, we get 1kN hk H 1 k max sup |Pn (( · f )hk ) | /2, and it follows from (2.3) that with probability at least 1 - N -A f 1 ^ + /2 f 1 = 3 f 1, - /2 L2 () ,j J and ({hj : j J }) its max¯ implying (2.1). In particular, we can use in (2.1) f := f /4 . Then, by the necessary conditions of extremum in the definition of f /4 , ( | max sup |P · f /4 )hk , 1kN hk H 1 4 k and the second bound follows. We now provide an alternative set of conditions on so that (2.1) holds. By the conditions on the loss, · f C (1 + L f ) C (1 + L f 1 ) imal eigenvalue. Let a (J ) := inf ({hj : j J }) : hj Hj , hj L2 () = 1 nd A (J ) := sup ({hj : j J }) : hj Hj , hj L2 () = 1 ¯ lso, denote LJ the linear span of subspaces Hj , j J. Let (J ) := sup , g L2 () : f LJ , g LJ c , f L2 () g L2 () . f = 0, g = 0 f In what follows, we will consider a set O = O(M1 , M2 ) of functions (more precisely, their additive representations) f = f1 + · · · + fN H, fj Hj , j = 1, . . . , N that will be called "admissible oracles". Let Jf := {j : fj = 0} and suppose the following assumptions hold: O1. The "relevant" part Jf of the dictionary satisfies the condition (Jf ) ¯ M1 . (Jf )(1 - 2 (Jf )) O2. For some > 1/2 and for all j Jf (j ) k fj Hj is small. In other words, the solution obtained via 1-penalized empirical risk minimization is adaptive to sparsity (at least, sub ject to constraints described above). Pro of. Throughout the proof we fix representa^ ^ ^ tions f = f1 + · · · + fN and f = f1 + · · · + fN (and we ^ ^ use (1.2) to define fj ). The definition of fj implies that for all f H, j Jf ^ Pn ( · f ) + f 1 Pn ( · f ) + f 1. ^ Therefore, ^ E (f ) + E (f ) + j Jf j Jf fj Hj ^ ( fj Hj - fj Hj ) ^ M2 k -2 , k = 1, 2, . . . Recall that Dj is the linear operator defined in the first section. Denote j Dj fj 2 2 () 1 L (f ) := . card(Jf ) fj 2 j H Jf ^ +(P - Pn )( · f - · f ). We first show that j ^ E (f ) + 2 Jf fj Hj ^ We are now in the position to state the main result of this paper. Theorem 2 There exist constants D, L depending only on (L = 0 if u is uniformly bounded) such that for al l A 1, for al l f O with card(Jf ) = d and for al l l og N D(1 + LR) n with probability at least 1 - N -A j ^ E (f ) + 2 fj Hj ^ Jf 3E (f ) + 2 (f )d 2 (Jf )(1 - 2 (Jf )) ^ +2(P - Pn )( · f - · f ), where = ( f f f ). ^ Let sj (fj ) be a subgradient of fj fj Hj at f fj Hj , i.e. sj (fj ) = fj jH if fj = 0 and sj (fj ) is an j arbitrary vector with sj (fj ) Hj 1 otherwise. Then we have ^ fj H - f H sj (fj ), fj - f H ^ j d(2 -1)/(2 +1) 7E (f ) + K n2 /(2 +1) j j j j A log N + (f )d + n 2 , = It follows that ^ E (f ) + j where K is a constant depending on , R, M1 , M2 , f and f . The meaning of this result can be described as follows. Suppose there exists an oracle f such that the excess risk of f is small (i.e., f provides a good approximation of f ); the set Jf is small (i.e., f has a sparse representation in the dictionary); the condition (O1) is satisfied, i.e. the relevant part of the dictionary is "well posed" in the sense that the spaces Hj , j Jj are not "too dependent" among themselves and with the rest of the spaces in the dictionary; the condition (O2) is satisfied, which means "sufficient smoothness" of functions in the spaces Hj , j Jf ; finally, the components fj , j Jf of the oracle f are even smoother jj 2 () in the sense that the quantities , j Jf are fj Hj properly bounded. Then the excess risk of the empir^ ical solution f is controlled by the excess risk of the oracle as well as by its degree of sparsity d and, at the ^ same time, f is approximately sparse in the sense that ^ Dj sj (fj ), fj - fj L2 () ^ Dj sj (fj ) L2 () fj - fj L2 () . fj Hj ^ Jf j E (f ) + Jf 1 /2 Dj sj (fj ) 2 2 () L ^ fj - fj 2 2 () L 1 /2 × j × Jf ^ +(P - Pn )( · f - · f ). It can also be shown that (see Koltchinskii, 2007, Proposition 1) j 1/2 ^ 2 fj - fj L2 () Jf Df L 1 (Jf )(1 - 2 (Jf )) ^ f - f L2 () . This allows us to write j ^ E (f ) + fj Hj ^ Jf Denote | n (, , R) := sup (Pn - P )( · g - · f )| : . j g - f L2 () , gj Hj , g 1 R Jf E (f ) + (f )d ^ f - f L2 () (Jf )(1 - 2 (Jf )) ^ +(P - Pn )( · f - · f ). Then, using the bounds f -f and ^ E (f ) f - f 2 2 () , E (f ) f - f 2 2 () , ^ L L we get ^ E (f ) + j Jf ^ L2 () f - f L2 () + f - f L2 () ^ If f 1 R (which holds with probability at least ^ 1 - N -A ), then we have ^ E (f ) + 2 3E (f ) + w +2n j Jf fj Hj ^ fj Hj ^ 2 (f )d 2 (Jf )(1 - 2 (Jf )) ^ j f - f L2 () , fj Hj , R ^ Jf (f )d × E (f ) + (Jf )(1 - 2 (Jf )) E + E (f ) ^ (f ) × + ^ (P - Pn )( · f - · f ). ith = (R f f ). We use Lemma 8 to get ^ E (f ) + 2 3E (f ) + j Jf fj Hj ^ Applying the inequality ab a2 /2 + b2 /2, we show that E (f )d (f ) (Jf )(1 - 2 (Jf )) Similarly, (f )d (Jf )(1 - 2 (Jf )) E ^ (f ) E (f ) (f )d + 2 . 2 2 (Jf )(1 - 2 (Jf )) ^ E (f ) (f )d + 2 . 2 2 (Jf )(1 - 2 (Jf )) This leads to the following bound j ^ E (f ) + fj Hj ^ Jf E (f ) + ^ (f )d E (f ) + 2 2 2 (Jf )(1 - 2 (Jf )) E (f ) (f )d + + 2 2 2 (Jf )(1 - 2 (Jf )) ^ +(P - Pn )( · f - · f ). j Jf 2 (f )d 2 (Jf )(1 - 2 (Jf )) ¯ (Jf ) × +C (1 + LR) (Jf )(1 - 2 (Jf )) m d k (j ) axj Jf m >m k × f - f L2 () ^ +R + n n l m og d j (j ) ax m +R + fj Hj × ^ j Jf n Jf l + og(N - d) + 1 × C (1 + LR) × n A log N × f - f L2 () ^ n A log N +C R(1 + LR) (3.1) n (Lemma 8 can be used only under the assumption R eN ; however, for very large R > eN , the proof of the inequality of the theorem is very simple). Recall that E f - f L2 () ^ It easily follows that ^ E (f ) + 2 fj Hj ^ ^ (f ) + E (f ) . 2 (f )d 3E (f ) + 2 (Jf )(1 - 2 (Jf )) +2(P - Pn )( · f - · f ). ^ Under the assumption l C (1 + LR) og N , n we get ^ E (f ) + j Jf fj Hj ^ This yields the following bound j 1 ^ E (f ) + fj Hj ^ 2 Jf 2 (f )d 3E (f ) + 2 2 + (Jf )(1 - 2 (Jf )) ¯ (Jf ) C (1 + LR) × (Jf )(1 - 2 (Jf )) E d E ^ (f ) (f ) m + +R× × n m l k + m (j ) axj Jf og d (j ) >m k × +R ax m j Jf n n E E ^ A (f ) (f ) log N C (1 + LR) + n A log N +C R(1 + LR) . (3.2) n Then we have ¯ (Jf ) (Jf )(1 - 2 (Jf )) E ^ (f ) d 2 (f )d 7 E (f ) + 2 2 + 2 (Jf )(1 - 2 (Jf )) ¯ (Jf ) dm 4C 2 (1 + LR)2 (Jf )(1 - 2 (Jf )) n m k (j ) axj Jf >m k +R + n l + m og d (j ) R ax m 4C 2 (1 + LR)2 × j Jf n A log N A log N × + C R(1 + LR) . (3.3) n n It remains to take -2/(2 +1) ( n1/(2 +1) 1 + LR)2 × m := 2/(2 +1) R d ¯ -2/(2 +1) (Jf ) × (Jf )(1 - 2 (Jf )) to get the following bound (with some constant C > 0) j ^ E (f ) + 2 fj Hj ^ C (1 + LR) m n Jf 1 ^ (Jf ) ¯ dm E (f ) + 2C 2 (1 + LR)2 2 (J )) n 4 (Jf )(1 - f and ¯ (Jf ) (Jf )(1 - 2 (Jf )) E (f ) d m n C (1 + LR) 1 (Jf ) ¯ dm E (f ) + 2C 2 (1 + LR)2 . 4 (Jf )(1 - 2 (Jf )) n (f )d 2 + (Jf )(1 - 2 (Jf )) ( (2 -1)/(2 +1) 1 + LR)2 +C × ¯ (2 -1)/(2 +1) (Jf ) × × (Jf )(1 - 2 (Jf )) 4 d(2 -1)/(2 +1) × + C 2 (1 + LR)2 + n2 /2 +1 A log N +C R(1 + LR) , (3.4) n 7E (f ) + 8 which implies the result. Similarly, E C (1 + LR) 4 log N n 1 ^ A log N E (f ) + 2C 2 (1 + LR)2 4 n ^ (f ) A App endix The Rademacher process is defined as jn Rn (g ) := n-1 j g (Xj ) =1 and log N n 1 2 2 A log N E (f ) + 2C (1 + LR) . 4 n E (f ) A C (1 + LR) where {j } are i.i.d. Rademacher random variables independent of {Xj }. We will need several bounds for Rademacher processes indexed by functions from RKHS (some of them are well known; see, e.g., Mendelson (2002) and Blanchard, Bousquet and Massart (2007)). We state them without proofs for brevity. First we consider a single RKHS HK where K is a kernel with eingenvalues k and eigenfunctions k (in L2 ()). Lemma 3 The fol lowing bound holds: E h sup H K 1 |Rn (h)| k=1 k which implies the result. As before, denote Lj , L the subspaces of L2 () j . spanned on {k : k m} and {k : k > m}, re spectively, PLj , PLj being the corresponding orthogonal pro jections. Recall that sequence {j } is nonincreasing. k The following statement is a uniform version of Lemma 4. (j ) (j ) n Let m 1. Denote by L the linear span of the functions {k : k = 1, . . . , m} and by L the closed linear span (in L2 ()) of the functions {k : k m + 1}. PL , PL will denote orthogonal pro jectors in L2 () on the corresponding subspaces. Lemma 4 For al l m 1, E h Lemma 6 With some numerical constant C, k=m+1 sup H K 1 |Rn (PL h)| k n . E max m 1j N hj H 1 j sup |Rn (PL hj )| j k=m+1 We now turn to the case of a dictionary {Hj : j = 1, . . . , N } of RKHS with kernels {Kj : j = 1, . . . N }. (j ) As before, denote {k : k 1} the eigenvalues (ar(j ) ranged in decreasing order) and {k : k 1} the L2 ()-orthonormal eigenfunctions of Kj . The following bounds will be needed in this case. Lemma 5 With some numerical constant C, m (j ) ax1j N k=1 k E max sup |Rn (hj )| 1j N hj H 1 n j l og N . +C n Pro of. We use bounded difference inequality to get for each j = 1, . . . , N with probability at least 1 - e-t-log N C t + log N sup |Rn (hj )| E sup |Rn (hj )|+ . n hj Hj 1 hj Hj 1 By the union bound, this yields with probability at least 1 - N e-t-log N = 1 - e-t 1j N hj H 1 j 2 +2 ax1j N n m 1j N k (j ) ax (j ) m l og N + C log N + C +2 . n n Lemma 7 The fol lowing bound holds: | Rn (g - f )| : g - f L2 () , g 1 R, ¯ d j (Jf ) m gj Hj C (Jf )(1 - 2 (Jf )) n Jf m l k m (j ) axj Jf og d (j ) >m k +2R + CR ax m j Jf n n l og(N - d) + 1 . +C n E sup max sup |Rn (hj )| max E 1j N sup hj Hj 1 |Rn (hj )| Pro of. First note that R E sup j n =1 C t C log N , + + n n |Rn (hj )| max E 1j N jN g j : - fj ) g - f L2 () , g R E sup R, n Jf 1 R, : which holds for all t > 0 and implies that E max 1j N hj H 1 j j (gj - fj ) sup sup hj Hj 1 |Rn (hj )| gj Hj Jf C log N + n R E sup g - f L2 () , g j (gj - fj ) Jf j Jf + gj Hj R, . gj Hj Jf 1 with a properly chosen constant C > 0. Note that, by Lemma 3, (j ) k=1 k E sup |Rn (hj )| , n hj Hj 1 : g - f L2 () , g j 1 n The second term can be bounded as follows: R E sup j Jf n Jf j gj : g - f L2 () , g R E sup j Jf n Jf 1 Finally, we use Lemma 6 to get Rj PL (gj - fj ) E sup n j R, R : E sup n Jf Jf : g 1 R g R, j gj - fj Hj PL hj j : 1 j gj Hj hj gj Hj hj Hj 1, j = 1, . . . , N 2RE max j Jf gj Hj , hj Hj 1 l og(N - d) + 1 , n sup hj Hj 1 |Rn (PL hj )| j k (j ) E max j Jf sup hj Hj 1 |Rn (hj )| C m 2R axj Jf n k >m m +C (j ) ax m j Jf l og d . n Combining the above bounds we get | where we used Lemma 5. As to the first term, we use E sup Rn (g - f )| : g - f L2 () , g 1 R, the bound ¯ d j Rj : (Jf ) m gj Hj C E sup g - f L2 () , g 1 R, (gj - fj ) (Jf )(1 - 2 (Jf )) n n Jf Jf m l Rj : k m (j ) j axj Jf og d (j ) >m k gj Hj E sup PLj (gj - fj ) +2R + CR ax m n j Jf n n Jf Jf l + og(N - d) + 1 . +C g - f L2 () n Rj . : Recall that E sup g 1 R PL (gj - fj ) n j n (, , R) := Jf | , sup (Pn - P )( · g - · f )| : g G (, , R) Note that j Jf where 2 L2 () PLj (gj - fj ) j Jf (Jf ) ¯ j Jf P Lj (gj - fj ) 2 L2 () g : g - f L2 () , j Jf G (, , R) := . gj Hj , g 1 R (Jf ) ¯ gj - fj L2 () 2 (Jf ) j ¯ (Jf ) jn (gj - fj ) Jf 2 L2 () We will assume that R eN (recall also the assumption N n ). Lemma 8 There exist constants C, L depending only on the loss (L = 0 if is bounded) such that for al l n-1/2 2R, n-1/2 R (4.1) and for al l A 1 the fol lowing bound holds with probability at least 1 - N -A : ¯ (Jf ) × n (, , R) C (1 + LR) (Jf )(1 - 2 (Jf )) m d k (j ) axj Jf m >m k +R + n n l l + m og d og(N - d) + 1 (j ) R ax m + j Jf n n A log N A log N C (1 + LR) + C R(1 + LR) . (4.2) n n (Jf ) ¯ (Jf )(1 - 2 (Jf )) (gj - fj ) 2 L2 () =1 (Jf ) ¯ 2 . (Jf )(1 - 2 (Jf )) j Also, P (g - fj ) takes values in the linear span j Jj Lj j of Jf Lj whose dimension dm. This yields the following bound R E sup n Jf j PLj (gj - fj ) : g - f L2 () d m . n C ¯ (Jf ) (Jf )(1 - 2 (Jf )) Proof. First note that, by Talagrand's concentration inequality, with probability at least 1 - e-t n ( ; ; R) E 2 n ( ; , R) + C (1 + LR) C R(1 + LR)t + n n t . the following bound holds for all j , k satisfying (4.4): 2 . R ~ n (j , k , R) n j , k , R; t + 2 log log n It is enough now to substitute in the above bound t := A log N and to use the fact that the functions n (, , R) ~ and n (, , R; t) are nondecreasing with respect to and . Together with the conditions R eN and N n , this implies the claim. To apply Talagrand's inequality we used the assumptions on the loss function. It follows from these assumptions that for all g G (, , R) ·g- ·f L2 () C (1 +LR) g-f L2 () C (1 +LR) and also · g - · f C R(1 + LR). References [1] Bousquet, O. and Herrmann, D. (2003), On the complexity of learning the kernel matrix, In: Advances in Neural Information Processing Systems Next, by symmetrization inequality, 15. | . [2] Blanchard, G., Bousquet, O. and Massart, P. En ( ; , R) 2E sup Rn (( ·g- ·f : g G (, , R). (2008), Statistical performance of support vector machines, Annals of Statistics, 36, 489-531. We write u = g - f and [3] Crammer, K., Keshet, J. and Singer, Y. (2003), · g - · f = · (f + u) - · f Kernel design using boosting, In: Advances in Neural Information Processing Systems 15. and observe that the function [4] Koltchinskii, V. (2008), Sparsity in penalized empirical risk minimization, Ann. Inst. H. Poincar´, e [-R, R] u · (f + u) - · f to appear. is Lipschitz with constant C (1 + LR). This allows us [5] Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, to use Rademacher contraction inequality (Ledoux and L. and Jordan, M. (2004), Learning the kernel Talagrand, 1991) to get matrix with semidefinite programming, Journal of Machine Learning Research, 5, 27-72. En ( ; , R) C (1 + LR) × : . R [6] Ledoux, M. and Talagrand, M. (1991), Probability g G (, , R) ×E sup in Banach Spaces, Springer, New York. n (g - f ) [7] Lin, Y. and Zhang, H. (2006), Component selecion The last expectation can be further bounded by Lemma and smoothing in multivariate nonparametric re7. As a result, we get the following bound that holds gression, Annals of Statistics, 34, 2272-2297. with probability at least 1 - e-t : [8] Micchelli, C. and Pontil, M. (2005), Learning the d kernel function via regularization, Journal of Ma¯ (Jf ) m chine Learning Research, 6, 1099-1125. n ( ; , R) C (1 + LR) (Jf )(1 - 2 (Jf )) n [9] Mendelson, S. (2002) Geometric parameters of kerm nel machines, In: COLT 2002, Lecture Notes in Arl k m (j ) tificial Intelligence, 2375, Springer, 29­43. axj Jf og d (j ) >m k +R +R ax m + [10] Ravikumar, P., Liu, H., Lafferty, J. and Wasserj Jf n n man, L. (2007), SpAM: sparse additive models, to l t + og(N - d) + 1 appear in: Advances in Neural Information Pro C (1 + LR) cessing Systems (NIPS 07). n n [11] Srebro, N. and Ben-David, S. (2006), Learning C R(1 + LR)t ~ bounds for support vector machines with learned + =: n (, , R; t).(4.3) n kernels, In: Proceedings of 19th Annual Conference on Learning Theory (COLT 2006), 169-183. The next goal is to make the bound uniform in [12] Wahba, G (1990), Spline Models for Observational n-1/2 2R and n-1/2 R. (4.4) Data, Philadelphia: SIAM. [13] Yuan, M. (2007), Nonnegative garrote component To this end, consider selection in functional ANOVA models, in Proceedj := 2R2-j , j := R2-j . ings of AI and Statistics (AISTAT 07), 656-662. We will replace t by t + 2 log log(2R n) and use bound (4.3) for all = j and = k satisfying the conditions (4.4). By the union bound, with probability at least - 1 - log(R n) log(2R n) exp t - 2 log log(2R n) 1 - e-t ,