Generalization Bounds for Learning Kernels Corinna Cortes Google Research, 76 Ninth Avenue, New York, NY 10011. CORINNA @ GOOGLE . COM Mehryar Mohri MOHRI @ CIMS . NYU . EDU Courant Institute of Mathematical Sciences and Google Research, 251 Mercer Street, New York, NY 10012. Afshin Rostamizadeh Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012. ROSTAMI @ CS . NYU . EDU Abstract This paper presents several novel generalization bounds for the problem of learning kernels based on a combinatorial analysis of the Rademacher complexity of the corresponding hypothesis sets. Our bound for learning kernels with a convex combination of p base kernels using L1 regularization admits only a log p dependency on the number of kernels, which is tight and considerably more favorable than the previous best bound given for the same problem. We also give a novel bound for learning with a non-negative combination of p base kernels with an L2 regularization whose dependency on p is also tight and only in p1/4 . We present similar results for Lq regularization with other values of q, and outline the relevance of our proof techniques to the analysis of the complexity of the class of linear functions. Experiments with a large number of kernels further validate the behavior of the generalization error as a function of p predicted by our bounds. But the choice of the kernel, which is critical to the success of these algorithms, is typically left to the user. Rather than requesting the user to commit to a specific kernel, which may not be optimal, especially if the user's prior knowledge about the task is poor, learning kernel methods require the user only to supply a family of kernels. The learning algorithm then selects both the specific kernel out of that family, and the hypothesis defined based on that kernel. There is a large body of literature dealing with various aspects of the problem of learning kernels, including theoretical questions, optimization problems related to this problem, and experimental results (Lanckriet et al., 2004; Argyriou et al., 2005; 2006; Srebro & Ben-David, 2006; Lewis et al., 2006; Zien & Ong, 2007; Bach, 2008; Cortes et al., 2009a; Ying & Campbell, 2009). Some of this previous work considers families of Gaussian kernels (Micchelli & Pontil, 2005) or hyperkernels (Ong et al., 2005). Non-linear combinations of kernels have also been recently considered by Bach (2008) and Cortes et al. (2009b). But, the most common family of kernels examined is that of non-negative or convex combinations of some fixed kernels constrained by a trace condition, which can be viewed as an L1 regularization (Lanckriet et al., 2004), or by an L2 regularization (Cortes et al., 2009a). This paper presents several novel generalization bounds for the problem of learning kernels with the family of non-negative combinations of base kernels with an L1 or L2 constraint, or Lq constraints with some other values of q. One of the first learning bounds given by Lanckriet et al. (2004) for the family of convex combinations of p base kernels with an L1 constraint has the following form: R(h) R (h) + O 1 maxp Tr(Kk ) maxp ( Kk / Tr(Kk ))/2 , i=1 k=1 m where R(h) is the generalization error of a hypothesis h, R (h) is the fraction of training points with margin less than , and Kk is the kernel matrix associated 1. Introduction Kernel methods are widely used in statistical learning (Sch¨ lkopf & Smola, 2002; Shawe-Taylor & Cristianini, o 2004). Positive definite symmetric (PDS) kernels implicitly specify an inner product in a Hilbert space where largemargin techniques are used for learning and estimation. They can be combined with algorithms such as support vector machines (SVMs) (Boser et al., 1992; Cortes & Vapnik, 1995; Vapnik, 1998) or other kernel-based algorithms to form powerful learning techniques. Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Generalization Bounds for Learning Kernels to the kth base kernel. This bound and a similar one by Bousquet & Herrmann (2002) were both shown by Srebro & Ben-David (2006) to be always larger than one. Another bound by Lanckriet et al. (2004) for the family of linear or non-convex combinations of kernels was also shown, by the same authors, to be always larger than one. But Lanckriet et al. (2004) also presented a multiplicative bound for convex combinations of base kernels with an L1 constraint that is of the form R(h) R (h) + 2 classification loss function. Also, both the statement of the result and the proof seem to be considerably more complicated than ours. We also give a novel bound for learning with a non-negative combination of p base kernels with an L2 regularization whose dependency on p is also tight and only in p1/4 . We present similar results for Lq regularization with other values of q. The next section (Section 2) defines the family of kernels and hypothesis sets we examine. Section 3 presents a bound on the Rademacher complexity of the class of convex combinations of base kernels with an L1 constraint and a generalization bound for binary classification directly derived from that result. Similarly, Section 4 presents first a bound on the Rademacher complexity, then a generalization bound for Lq regularization for some other values of q > 1. We make a number of comparisons with existing bounds and conclude by discussing the relevance of our proof techniques to the analysis of the complexity of the class of linear functions (Section 5). p/ O . This bound converges and can perhaps be m viewed as the first informative generalization bound for this family of kernels. However, the dependence of this bound on the number of kernels p is multiplicative which therefore does not encourage the use of too many base kernels. Srebro & Ben-David (2006) presented a generalization bound based on the pseudo-dimension of the family of kernels that significantly improved on this bound. Their bound has the form R(h) R (h) + O p+R2 /2 m , where the notation O(·) hides logarithmic terms and where R2 is an upper bound on Kk (x, x) for all points x and base kernels Kk , k [1, p]. Thus, disregarding logarithmic terms, their bound is only additive in p. Their analysis also applies to other families of kernels. Ying & Campbell (2009) also gave generalization bounds for learning kernels based on the notion of Rademacher chaos complexity and the pseudo-dimension of the family of kernels used. For a pseudo-dimension of p as in the case of a convex combination of p base kernels, their bound is in O( p (R2 /2 )(log(m)/m)) and is thus multiplicative in p. It seems to be weaker than the bound of Lanckriet et al. (2004) and that of Srebro & Ben-David (2006) for such kernel families. We present new generalization bounds for the family of convex combinations of base kernels and an L1 constraint that have only a logarithmic dependency on p. Our learning bounds are based on a combinatorial analysis of the Rademacher complexity of the hypothesis set considered (log p)R / . and have the form: R(h) R (h) + O m Our bound is simpler, contains no other extra logarithmic term, and its log p dependency is tight. Thus, this represents a substantial improvement over the previous best bounds for this problem. Our bound is also valid for a very large number of kernels, in particular for p m, while the previous bounds were not informative in that case. 2 2 2. Preliminaries Let X denote the input space. For any kernel function K, we denote by K : x HK the feature mapping from X to the reproducing kernel Hilbert space HK induced by K. Most learning kernel algorithms are based on a hypothesis q Hp set derived from a non-negative combinations of a fixed set of p 1 kernels K1 , . . . , Kp with the mixture weights obeying an Lq constraint: p q Hp = x w ·K (x) : K = p k=1 µk Kk , µ q , w 1 , with q = {µ: µ 0, k=1 µq = 1}. Linear combinak tions with possibly negative mixture weights have also been considered in the literature, e.g., (Lanckriet et al., 2004), with the additional requirement that the combined kernel be PDS. We bound, for different values of q, including q = 1 and q q = 2, the empirical Rademacher complexity RS (Hp ) of these families for an arbitrary sample S of size m, which immediately yields a generalization bound for learning kernels based on these families of hypotheses. For a fixed sample S = (x1 , . . . , xm ), the empirical Rademacher complexity of H is defined by RS (H) = 1 i h(xi ) , E sup m hH i=1 m We note that Koltchinskii & Yuan (2008) also presented a bound with logarithmic dependence on p in the context of the study of large ensembles of kernel machines. However, their analysis is specific to the family of kernel-based regularization algorithms and requires the loss function to be strongly convex, which rules out for example the binary where the expectation is taken over = (1 , . . . , m ) where i {-1, +1}, i [1, m], are independent uniform random variables. Generalization Bounds for Learning Kernels For any kernel function K, we denote by K = [K(xi , xj )] Rm×m its kernel matrix associated to the m sample S. Let wS = i=1 i K (xi ) be the orthogonal projection of w on HS = span(K (x1 ), . . . , K (xm )). Then, w can be written as w = wS + w , with wS ·w = 0. Thus, w 2 = wS 2 + w 2 , which, in view of w 1 implies wS 2 1. Since wS 2 = K, this implies K 1. Observe also that for all x S, m 1 3. Rademacher complexity bound for Hp (1) Our bounds on the empirical Rademacher complexity of 1 q the families Hp or Hp for q = 2 or other values of q relies on the following result, which we prove using a combinatorial argument (see appendix). Lemma 1. Let K be the kernel matrix of a kernel function K associated to a sample S. Then, for any integer r, the following inequality holds: E ( K)r 0 r Tr[K] r , where 0 = i K(xi , x). (2) 23 22 . h(x) = w · K (x) = wS · K (x) = i=1 Conversely, any function m i K(xi , ·) with K i=1 1 1 is clearly an element of Hp . 1 Proposition 1. Let q, r 1 with q + 1 = 1. For any sample r S of size m, the empirical Rademacher complexity of the q hypothesis set Hp can be expressed as q RS (Hp ) = 1 E m u r This result can be viewed as a Khintchine-Kahane type inequality. In fact, it might be possible to benefit from the best constants for the vectorial version of this inequality to further improve the constant of the lemma. We will discuss this connection and its benefits in a longer version of this paper. For r = 1, the result holds with 0 replaced with 1 as seen in classical derivations for the estimation of the Rademacher complexity of linear classes. Theorem 1. For any sample S of size m, the empirical 1 Rademacher complexity of the hypothesis set Hp can be bounded as follows: r N, r 1, 1 RS (Hp ) with u = ( K1 , . . . , Kp ) . Proof. Fix a sample S = (x1 , . . . , xm ), and denote by Mq = {µ 0 : µ q = 1} and by A = { : K 1}. Then, in view of (1) and (2), the Rademacher complexq ity RS (Hp ) can be expressed as follows: q RS (Hp ) 0 r u m r , where u = (Tr[K1 ], . . . , Tr[Kp ]) and 0 = 23 22 . 1 1 u . Proof. By Proposition 1, RS (Hp ) = m E Since for any r 1, u u r , we can upper bound the Rademacher complexity as follows: 1 RS (Hp ) 1 = E m 1 E = m 1 = E m m q hHp i=1 sup i h(xi ) m sup µMq ,A i,j=1 i j K(xi , xj ) 1 E m 1 = E m 1 E m 1 m p u p r 1 2r ( Kk )r k=1 p sup µMq ,A K . = ( Kk )r k=1 1 2r (Jensen's inequality) . Now, by the Cauchy-Schwarz inequality, the supremum supA K is reached for K1/2 collinear with K1/2 , which gives supA K = K. Thus, q RS (Hp ) = E ( Kk )r k=1 1 2r 1 E m 1 = E m sup µMq K µ · u . Assume that r 1 is an integer, then, by Lemma 1, for any k [1, p], we have E ( Kk )r 0 r Tr[Kk ] r . sup µMq Using these inequalities gives 1 RS (Hp ) By the definition of the dual norm, supµMq µ · u = u r, q which gives RS (Hp ) = 1 m 1 m p 0 r Tr[Kk ] k=1 r 1 2r = 0 r u m r , E u r . and concludes the proof. Generalization Bounds for Learning Kernels Bound Theorem 2. Let p > 1 and assume that Kk (x, x) R2 for all x X and k [1, p]. Then, for any sample S of size m, 1 the Rademacher complexity of the hypothesis set Hp can be bounded as follows: 1 RS (Hp ) 100 [Srebro & Ben-David, 2006] 10 1 0 elog pR2 . m 0.1 p=m p=m^(5/6) p=m^(4/5) p=m^(3/5) p=m^(1/3) p=10 [Our bound, 2010] 0.01 Proof. Since Kk (x, x) R2 for all x X and k [1, p], Tr[Kk ] mR2 for all k [1, p]. Thus, by Theorem 1, for any integer r > 1, the Rademacher complexity can be bounded as follows 1 RS (Hp ) 0 5 10 15 m in Millions 1 p 0 rmR2 m r 1 2r = 0 rp r R2 . m 1 1 r0 = log p, which gives RS (Hp ) For p > 1, the function r p1/r r reaches its minimum at 0 elog pR2 . m Figure 1. Plots of the bound of Srebro & Ben-David (2006) (dashed lines) and our new bounds (solid lines) as a function of the sample size m for = .01 and /R = .2. For these values and m 15 × 106 , the bound of Srebro and Ben-David is always above 1, it is of course converging for sufficiently large m. The plots for p = 10 and p = m1/3 roughly coincide in the case of the bound of Srebro & Ben-David (2006), which makes the first one not visible. Note that more generally, without assuming Kk (x, x) R2 for all k and all x, the same proof yields the following result: 0 elog p u 1 RS (Hp ) . m Remarkably, the bound of the theorem has a very mild dependence on p. The theorem can be used to derive generalization bounds for learning kernels in classification, regression, and other tasks. We briefly illustrate its application to binary classification where the labels y are in {-1, +1}. Let R(h) denote the generalization error of 1 h Hp , that is R(h) = Pr[yh(x) < 0]. For a training sample S = ((x1 , y1 ), . . . , (xm , ym )) and any > 0, define the -empirical margin loss R (h) as follows: R (h) = 1 m m i=1 If additionally, Kk (x, x) R2 for all x X and k [1, p], then, for p > 1, 0 elog pR2 /2 +3 m log 2 . 2m R(h) R (h) + 2 Proof. With our definition of the Rademacher complexity, for any > 0, with probability at least 1 - , the following 1 bound holds for any h Hp (Koltchinskii & Panchenko, 2002; Bartlett & Mendelson, 2002): 2 1 R(h) R (h) + RS (Hp ) + 3 log 2 . 2m min 1, [1 - yi h(xi )/]+ . Plugging in the bound on the empirical Rademacher complexity given by Theorem 1 and Theorem 2 yields the statement of the corollary. The bound of the Corollary can be straightforwardly extended to hold uniformly over all choices of , using standard techniques introduced by Koltchinskii & Panchenko 2 (4R/) (2002), at the price of the additional term log logm on the right-hand side. The corollary gives a generalization bound for learning ker1 nels with Hp that is in O (log p) R2 /2 . m Note that R (h) is always upper bounded by the fraction of the training points with margin less than : R (h) 1 m m 1yi h(xi )< . i=1 The following gives a margin-based generalization bound 1 for the hypothesis set Hp . Corollary 1. Fix > 0. Then, for any integer r > 1, for any 1 > 0, with probability at least 1-, for any h Hp , R(h) R (h) + 2 0 r u m r +3 log 2 . 2m 23 22 . with u = (Tr[K1 ], . . . , Tr[Kp ]) and 0 = In comparison, the best previous bound for learning kernels with convex combinations given by Srebro & Ben-David (2006) derived using the pseudo-dimension has a stronger Generalization Bounds for Learning Kernels Experimental Validation 0.8 0.6 0.4 0.2 0 200 400 p 600 800 R(h) Theoretical Bound 0.8 0.6 0.4 0.2 0 200 400 p 600 800 Experimental Validation R(h) Theoretical Bound a margin bound with = 1 implies a standard generalization bound with the same complexity term. By the classical VC dimension lower bounds (Devroye & Lugosi, 1995; Anthony & Bartlett, 1999), that complexity term must be at least in VCDim(J p )/m = ( log p/m). A related simple example showing this lower bound was also suggested to us by N. Srebro. We have also tested experimentally the behavior of the test error as a function of p and compared it to that of the theoretical bound given by Corollary 1 by learning with a large number of kernels p [200, 800], a sample size of m = 36,000, and a normalized margin of /R = .2. These results are for rank-1 base kernels generated from individual features of the MNIST dataset (http://yann. lecun.com/exdb/mnist/). The magnitude of each kernel weight is chosen proportionally to the correlation of the corresponding feature with the training labels. The results show that the behavior of the test error as a function of p matches the one predicted by our bound, see Figure 2(a). q 4. Rademacher complexity bound for Hp (a) L1 Bound (b) L2 Bound Figure 2. Variation of the empirical test error and R(h) as a function of the number of kernels, for R(h) given by (a) Corollary 1 for L1 regularization; (b) Corollary 2 for L2 regularization. For these experiments, m = 36,000, /R = .2, and = .01. dependency with respect to p and is more complex: O 8 2+p log 128em R +256 R2 log em log 128mR 2 p 8R 2 m 3 2 2 2 . This bound is also not informative for p > m. Figure 1 compares the bound on R(h) - R (h) obtained using this expression by Srebro and Ben-David with the new bound of Corollary 1, as a function of the sample size m. The comparison is made for different values of p, a normalized margin of /R = .2 and the confidence parameter set to = .01. Plots for different values of /R are quite similar. As shown by the figure, larger values of p can significantly affect the bound of Srebro and Ben-David leading to quasi-flat plots for p > m4/5 . In comparison, the plots for our new bound show only a mild variation with p even for relatively large values such as p m. Note also that, while the bound of Srebro and Ben-David does converge and becomes informative, its values, even for p = 10, are still above 1 for fairly large values of m. The new bound, in contrast, strongly encourages considering large numbers of base kernels in learning kernels. It was brought to our attention by an ICML reviewer that a bound similar to that of Theorem 2, with somewhat less favorable constants and for the expected value, was recently derived by Kakade et al. (2010) using a strong-convexity/smoothness argument. Lower bound The log p dependency of our generalization bound with respect to p cannot be improved upon. This can be seen by arguments in connection with the VC dimension lower bounds. Consider the case where the input space is X ={-1, +1}p and where the feature mapping of each base kernel Kk , k [1, p], is simply the canonical projection x +xk or x -xk , where xk is the p kth component of x X . Thus, H1 then contains the hypothesis set J p = {x sxk : k [1, p], s {-1, +1}} whose VC dimension is in (log p). For = 1 and h J p , for any xi X , yi h(xi ) < is equivalent to yi h(xi ) < 0. Thus, the empirical margin loss R (h) coincides with the standard empirical error R(h) for h J p and This section presents bounds on the Rademacher complexq ity of the hypothesis sets Hp for various values of q > 1, including q = 2. 1 Theorem 3. Let q, r 1 with q + 1 = 1 and assume that r is r an integer. Then, for any sample S of size m, the empirical q Rademacher complexity of the hypothesis set Hp can be bounded as follows: q RS (Hp ) 0 r u m r , 23 22 . where u = (Tr[K1 ], . . . , Tr[Kp ]) and 0 = 1 q Proof. By Proposition 1, RS (Hp ) = m E u r . with u = ( K1 , . . . , Kp ) . The rest of the proof is identical to that of Theorem 1: using Jensen's inequality and Lemma 1, which applies because r is an integer, we obtain similarly q RS (Hp ) 1 m p 0 r Tr[Kk ] k=1 r 1 2r . In particular, for q = r = 2, the theorem implies 2 RS (Hp ) 20 u m 2 . 1 Theorem 4. Let q, r 1 with 1 + r = 1 and assume that r q is an integer. Let p > 1 and assume that Kk (x, x) R2 for all x X and k [1, p]. Then, for any sample S of size m, q the Rademacher complexity of the hypothesis set Hp can Generalization Bounds for Learning Kernels be bounded as follows: q RS (Hp ) 10 Bound 0 rp R2 . m 1 r 1 p=m p=m^1/3 p=20 Proof. Since Kk (x, x) R2 for all x X and k [1, p], Tr[Kk ] mR2 for all k [1, p]. Thus, by Theorem 3, the Rademacher complexity can be bounded as follows 1 q RS (Hp ) p 0 rmR2 m r 1 2r 0.1 0.01 0 5 10 m in Millions 15 = 0 rp r R2 . m 1 The bound of the theorem has only a mild dependence ( 2r ·) on the number of kernels p. In particular, for q = r = 2, under the assumptions of the theorem, 20 pR2 2 , RS (Hp ) m and the dependence is in O(p1/4 ). Proceeding as in the L1 case leads to the following margin bound in binary classification. 1 Corollary 2. Let q, r 1 with 1+r = 1 and assume that r is q an integer. Fix > 0. Then, for any > 0, with probability q at least 1-, for any h Hp , R(h) R (h) + 2 0 r u m r Figure 3. Comparison of the L1 regularization bound of Corollary 1 and the L2 regularization bound of Corollary 2 (dotted lines) as a function of the sample size m for = .01 and /R = .2. For p = 20, the L1 and L2 bounds roughly coincide. Lower bound The p1/(2r) dependency of the generalization bound of Corollary 2 cannot be improved. In particu2 lar, the p1/4 dependency is tight for the hypothesis set Hp . This is clear since in particular when all p kernel functions p p q are equal, k=1 µk Kk = ( k=1 µk )K1 p1/r K1 . Hp q then coincides with the set of functions in H1 each multiplied by p1/(2r) . 5. Proof techniques Our proof techniques are somewhat general and apply similarly to other problems. In particular, they can be used as alternative methods to derive bounds on the Rademacher complexity of linear functions classes, such as those given by Kakade et al. (2009), using strong convexity. In fact, in some cases, they can lead to similar bounds but with tighter constants. The following theorem illustrates that in the case of linear functions constrained by the norm · q . 1 1 Theorem 5. Let q, r 1 with q + r = 1, r an even integer such that r 2. Let X = {x : x r X}, and let F be the class of linear functions over X defined by F = {x w · x : w q W }, then, for any sample S = (x1 , . . . , xm ), the following bound holds for the empirical Rademacher complexity of this class: RS (F ) XW 0 r . 2m +3 log . 2m 23 22 . 2 with u = (Tr[K1 ], . . . , Tr[Kp ]) and 0 = If additionally, Kk (x, x) R2 for all x X and k [1, p], then, for p > 1, R(h) R (h) + 2p 2r 1 0 rR2 /2 +3 m 2 log . 2m In particular, for q = r = 2, the generalization bound of the corollary becomes R(h) R (h) + 2p 4 1 20 R2 /2 +3 m log 2 . 2m Figure 3 shows a comparison of the L2 regularization bound of this corollary with the L1 regularization bound of Corollary 1. As can be seen from the plots, the two bounds are very close for smaller values of p. For larger values (p m), the difference becomes significant. The bound for L2 regularization is converging for these values but at a R/ slower rate of O m1/4 . As with the L1 bound we also tested experimentally the behavior of the test error as a function of p and compared it to that of the theoretical bound given by Corollary 2 by learning with a large number of kernels. Again, our results show that the behavior of the test error as a function of p matches the one predicted by our bound, see Figure 2(b). Clearly, this immediately yields the same bound on the Rademacher complexity Rm (F ) = ES [RS (F )]. The bound given by Kakade et al. (2009)[Section 3.1] in this case is Rm (F ) XW r-1 . Since 0 r/2 r -1, for an m even integer r > 2, our bound is always tighter. Proof. The proof is similar to and uses that of Theorem 1. By the definition of the dual norms, the following holds: RS (F )= 1 E m m sup w q W i=1 i w · xi = W E m m i xi i=1 r . Generalization Bounds for Learning Kernels By Jensen's inequality, m m References i xi r r 1 r N j=1 m E i=1 i xi r E = E i=1 i xij r , 1 r Anthony, Martin and Bartlett, Peter L. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. Argyriou, Andreas, Micchelli, Charles, and Pontil, Massimiliano. Learning convex combinations of continuously parameterized basic kernels. In COLT, 2005. Argyriou, Andreas, Hauser, Raphael, Micchelli, Charles, and Pontil, Massimiliano. A DC-programming algorithm for kernel selection. In ICML, 2006. r 2 i=1 where we denote by N the dimension of the space and by xij the jth coordinate of xi . Now, we can bound the term E m m i=1 i xij r r using Lemma 1 and obtain: m r 2 E i=1 i xij =E i,l=1 i l xij xlj 0 r 2 m i=1 x2 . ij Bach, F. Exploring large feature spaces with hierarchical multiple kernel learning. In NIPS 2009, 2008. Bartlett, Peter L. and Mendelson, Shahar. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:2002, 2002. Boser, Bernhard, Guyon, Isabelle, and Vapnik, Vladimir. A training algorithm for optimal margin classifiers. In COLT, 1992. Bousquet, Olivier and Herrmann, Daniel J. L. On the complexity of learning the kernel matrix. In NIPS, 2002. Cortes, Corinna and Vapnik, Vladimir. Support-Vector Networks. Machine Learning, 20(3), 1995. Thus, W 0 r RS (F ) m 2 =W 0 r 2m 1/2 N j=1 N j=1 m x2 ij i=1 m r/2 1 r 1 m x2 ij i=1 r/2 1 r . r/2 Since r 2, by Jensen's inequality, 1 m m i=1 1 m m i=1 x2 ij xr . Thus, ij Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. L2 regularization for learning kernels. In UAI 2009, 2009a. Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Learning non-linear combinations of kernels. In NIPS 2009. MIT Press, 2009b. Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Two-stage learning kernel methods. In ICML 2010, 2010. Devroye, Luc and Lugosi, G´ bor. Lower bounds in pattern recoga nition and learning. Pattern Recognition, 28(7), 1995. Kakade, Sham M., Sridharan, Karthik, and Tewari, Ambuj. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In NIPS, 2009. Kakade, Sham M., Shalev-Shwartz, Shai, and Tewari, Ambuj. Applications of strong convexity­strong smoothness duality to learning with matrices, 2010. arXiv:0910.0610v1. Koltchinskii, V. and Panchenko, D. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, 2002. Koltchinskii, Vladimir and Yuan, Ming. Sparse recovery in large ensembles of kernel machines on-line learning and bandits. In COLT, 2008. Lanckriet, Gert, Cristianini, Nello, Bartlett, Peter, Ghaoui, Laurent El, and Jordan, Michael. Learning the kernel matrix with semidefinite programming. JMLR, 5, 2004. Lewis, Darrin P., Jebara, Tony, and Noble, William Stafford. Nonstationary kernel combination. In ICML, 2006. Micchelli, Charles and Pontil, Massimiliano. Learning the kernel function via regularization. JMLR, 6, 2005. Ong, Cheng Soon, Smola, Alex, and Williamson, Robert. Learning the kernel with hyperkernels. JMLR, 6, 2005. RS (F ) W =W 0 r 2m N j=1 1 m m m xr ij i=1 r r 1 r 1 r 0 r 1 2m m xi i=1 W 0 r X. 2m 6. Conclusion We presented several new generalization bounds for the problem of learning kernels with non-negative combinations of base kernels and outlined the relevance of our proof techniques to the analysis of the complexity of the class of linear functions. Our bounds are simpler and significantly improve over previous bounds. Their behavior matches empirical observations with a large number of base kernels. Their very mild dependency on the number of kernels suggests the use of a large number of kernels for this problem. Recent experiments by Cortes et al. (2009a; 2010) in regression using a large number of kernels seems to corroborate this idea. Much needs to be done however to combine these theoretical findings with the somewhat disappointing performance observed in practice in most learning kernel experiments. Acknowledgments We thank ICML reviewers for insightful comments on an earlier draft of this paper and N. Srebro for discussions. Generalization Bounds for Learning Kernels Sch¨ lkopf, Bernhard and Smola, Alex. Learning with Kernels. o MIT Press: Cambridge, MA, 2002. Shawe-Taylor, John and Cristianini, Nello. Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004. Srebro, Nathan and Ben-David, Shai. Learning bounds for support vector machines with learned kernels. In COLT, 2006. Vapnik, Vladimir N. Statistical Learning Theory. John Wiley & Sons, 1998. Ying, Yiming and Campbell, Colin. Generalization bounds for learning the kernel problem. In COLT, 2009. Zien, Alexander and Ong, Cheng Soon. Multiclass multiple kernel learning. In ICML 2007, 2007. We now derive an upper bound on the terms appearing in the exponent. Using the inequalities imposed on ti and 2ti and the fact that the sum of ti s is r leads to: ti -2ti 1 1 - = 12ti 24ti + 1 1+ 24ti + 1 1 12 ti 1 ti 1 13 12 ti 1 ti 1 12ti + 1 12ti [24ti + 1] 13r , 300 ti 1 25 1 and 2r -r 24r - 12r1 +1 0. The inequality e13/300 < 1 + 1/22 then yields the statement of the lemma. B. Proof of Lemma 1 Proof. Since r is an integer, we can write: m r A. Bound on Multinomial Coefficients In the proof of Lemma 1, we need to upper bound the ratio r 2r 2t1 ,...,2tm / t1 ,...,tm . The following rough but straightforward inequality is sufficient to derive a bound on the Rademacher complexity with somewhat less favorable constants: 2r 2t1 ,...,2tm E ( K)r = E r i,j=1 i j Kk (xi , xj ) r = 1i1 ,...,ir m 1j1 ,...,jr m E s=1 r is js s=1 r Kk (xis , xjs ) (2r )! (2r )! = (2t1 )! · · · (2tm )! (t1 )! · · · (tm )! (2r )r · r ! = (2r )r (t1 )! · · · (tm )! E 1i1 ,...,ir m 1j1 ,...,jr m s=1 r is js s=1 |Kk (xis , xjs )| r t1 ,...,tm . E 1i1 ,...,ir m 1j1 ,...,jr m r s=1 s=1 is js To further improve this result, the next lemma uses Stirling's approximation valid for all n 1: n! = n 1 1 2n n en , with 12n+1 < n < 12n . e Lemma 2. For all r > 0 and t1 , . . . , tm , it holds that: 2r 2t1 ,...,2tm Kk (xis , xis )Kk (xjs , xjs ) (Cauchy-Schwarz) m (1 + r 1 22 )r r t1 ,...,tm . = 2r s1 ,...,sm s1 +...+sm =2r s sm E[11 · · · m ] Kk (xi , xi )si . i=1 Proof. By Stirling's formula, (2r )! 2r 2r = 2 r ! e 2r r = 22 e r e r -r e2r -r 4r = 2 e r (3) e2r -r . Since E[i ] = 0 for all i and since the Rademacher variables are independent, we can write E[i1 . . . il ] = E[i1 ] · · · E[il ] = 0 for any l distinct variables s sm even, in which case E 11 · · · m = 1. It follows that: s s i1 , . . . , il . Thus, E 11 · · · 1m = 0 unless all si s are e2r -r Similarly, for any ti 1, we can write e 1 ti ! = (2ti )! 2 4ti Using m i=1 ti ti e ti -2ti 1 e 2 4 ti E ( K)r e ti -2ti . m 2r Kk (xi , xi )ti . 2t1 ,...,2tm 2t1 +...+2tm =2r i=1 = ti 1 ti = r , we obtain: r P 2r By Lemma 2, each multinomial coefficient 2t1 ,...,2tm can r 23 be bounded by (0 r)r t1 ,...,tm , where 0 = 22 . This gives ti ! 1 e (2ti )! 2 4 ti 1 2r 2t1 ,...,2tm r t1 ,...,tm e ti 1 (ti -2ti ) . (4) E ( K)r (0 r)r In view of Eqn 3 and 4, the following inequality holds: / (r )r e2r -r + P ti 1 (ti -2ti ) = . m r Kk (xi , xi )ti t1 ,...,tm t1 +...+tm =r i=1 r r r (0 r) (Tr[K]) = 0 r Tr[K] , which is the statement of the lemma.