Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning Francis R. Bach ´ INRIA - Willow Project, Ecole Normale Superieure ´ 45, rue d'Ulm, 75230 Paris, France francis.bach@mines.org Abstract For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the 1 -norm or the block 1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance. 1 Introduction In the last two decades, kernel methods have been a prolific theoretical and algorithmic machine learning framework. By using appropriate regularization by Hilbertian norms, representer theorems enable to consider large and potentially infinite-dimensional feature spaces while working within an implicit feature space no larger than the number of observations. This has led to numerous works on kernel design adapted to specific data types and generic kernel-based algorithms for many learning tasks (see, e.g., [1, 2]). Regularization by sparsity-inducing norms, such as the 1 -norm has also attracted a lot of interest in recent years. While early work has focused on efficient algorithms to solve the convex optimization problems, recent research has looked at the model selection properties and predictive performance of such methods, in the linear case [3] or within the multiple kernel learning framework [4]. In this paper, we aim to bridge the gap between these two lines of research by trying to use 1 -norms inside the feature space. Indeed, feature spaces are large and we expect the estimated predictor function to require only a small number of features, which is exactly the situation where 1 -norms have proven advantageous. This leads to two natural questions that we try to answer in this paper: (1) Is it feasible to perform optimization in this very large feature space with cost which is polynomial in the size of the input space? (2) Does it lead to better predictive performance and feature selection? More precisely, we consider a positive definite kernel that can be expressed as a large sum of positive definite basis or local kernels. This exactly corresponds to the situation where a large feature space is the concatenation of smaller feature spaces, and we aim to do selection among these many kernels, which may be done through multiple kernel learning [5]. One major difficulty however is that the number of these smaller kernels is usually exponential in the dimension of the input space and applying multiple kernel learning directly in this decomposition would be intractable. In order to peform selection efficiently, we make the extra assumption that these small kernels can be embedded in a directed acyclic graph (DAG). Following [6], we consider in Section 2 a specific combination of 2 -norms that is adapted to the DAG, and will restrict the authorized sparsity patterns; in our specific kernel framework, we are able to use the DAG to design an optimization algorithm which has polynomial complexity in the number of selected kernels (Section 3). In simulations (Section 5), we focus on directed grids, where our framework allows to perform non-linear variable selection. We provide extensive experimental validation of our novel regularization framework; in particular, we compare it to the regular 2 -regularization and shows that it is always competitive and often leads to better performance, both on synthetic examples, and standard regression and classification datasets from the UCI repository. Finally, we extend in Section 4 some of the known consistency results of the Lasso and multiple kernel learning [3, 4], and give a partial answer to the model selection capabilities of our regularization framework by giving necessary and sufficient conditions for model consistency. In particular, we show that our framework is adapted to estimating consistently only the hull of the relevant variables. Hence, by restricting the statistical power of our method, we gain computational efficiency. 2 Hierarchical multiple kernel learning (HKL) We consider the problem of predicting a random variable Y Y R from a random variable X X , where X and Y may be quite general spaces. We assume that we are given n i.i.d. observations (xi , yi ) X × Y , i = 1, . . . , n. We define the empirical risk of a function f from X to R as n 1 + i=1 (yi , f (xi )), where : Y × R R is a loss function. We only assume that is convex n with respect to the second parameter (but not necessarily differentiable). Typical examples of loss 1 functions are the square loss for regression, i.e., (y , y) = 2 (y - y )2 for y R, and the logistic loss ^ ^ -y y ^ (y , y) = log(1 + e ^ ) or the hinge loss (y , y) = max{0, 1 - y y} for binary classification, where ^ ^ y {-1, 1}, leading respectively to logistic regression and support vector machines. 2.1 Graph-structured positive definite kernels We assume that we are given a positive definite kernel k : X × X R, and that this kernel can be expressed as the sum, over an index set V , of basis kernels kv , v V , i.e, for all x, x X , v k (x, x ) = V kv (x, x ). For each v V , we denote by Fv and v the feature space and feature map of kv , i.e., for all x, x X , kv (x, x ) = v (x), v (x ) . Throughout the paper, we denote by u the Hilbertian norm of u and by u, v the associated dot product, where the precise space is omitted and can always be inferred from the context. Our sum assumption corresponds to a situation where the feature map (x) and feature space F v for k is the concatenation of the feature maps v (x) for each kernel kv , i.e, F = V F v an d i (x) = (v (x))vV . Thus, looking for a certain F and a predictor function f (x) = , (x) v s equivalent to looking jointly for v Fv , for all v V , and f (x) = V v , v (x) . As mentioned earlier, we make the assumption that the set V can be embedded into a directed acyclic graph. Directed acyclic graphs (referred to as DAGs) allow to naturally define the notions of parents, children, descendants and ancestors. Given a node w V , we denote by A(w) V the set of its ancestors, and by D(w) V , the set of its descendants. We use the convention that any w is a descendant and an ancestor of itself, i.e., w A(w) and w D(w). Moreover, for W V , we let denote sources(W ) the set of sources of the graph G restricted to W (i.e., nodes in W with no parents belonging to W ). Given a subset of nodeswW V , we can define the hull of W as the union of all ancestors of w W , i.e., hull(W ) = W A(w). Given a set W , we define the set of extreme points of W asTthe smallest subset T W such that hull(T ) = hull(W ) (note that it is always well defined, as V , hull(T )=hull(W ) T ). See Figure 1 for examples of these notions. The goal of this paper is to perform kernel selection among the kernels kv , v V . We essentially use the graph to limit the search to specific subsets of V . Namely, instead of considering all possible subsets of active (relevant) vertices, we are only interested in estimating correctly the hull of these relevant vertices; in Section 2.2, we design a specific sparsity-inducing norms adapted to hulls. In this paper, we primarily focus on kernels that can be expressed as "products of sums", and on the associated p-dimensional directed grids, while noting that our framework is applicable to many other kernels. Namely, we assume that the input space X factorizes into p components X = X1 × · · · × Xp Figure 1: Example of graph and associated notions. (Left) Example of a 2D-grid. (Middle) Example of sparsity pattern (× in light blue) and the complement of its hull (+ in light red). (Right) Dark blue points (×) are extreme points of the set of all active points (blue ×); dark red points (+) are the sources of the set of all red points (+). and that we are given p sequences of length q + 1 of kernels kij (xi , x ), qi {1, . . . , p}, . j i q p p {0, . . . , q }, such that k (x, x ) = j1 ,...,jp =0 i=1 kiji (xi , xi ) = i=1 We ji =0 kiji (xi , xi ) thus have a sum ofp q +1)p kernels, that can be computed efficiently as a product of p sums. A natural ( DAG on V = i=1 {0, . . . , q } is defined by connecting each (j1 , . . . , jp ) to (j1 + 1, j2 , . . . , jp ), . . . , (j1 , . . . , jp-1 , jp + 1). As shown in Section 2.2, this DAG will correspond to the constraint of selecting a given product of kernels only after all the subproducts are selected. Those DAGs are especially suited to nonlinear variable selection, in particular with the polynomial and Gaussian kernels. In this context, products of kernels correspond to interactions between certain variables, and our DAG implies that we select an interaction only after all sub-interactions were already selected. q( Polynomial kernels We consider Xi = R, kij (xi , x ) = j xi x )j ; the full kernel is then equal i i q( j p q p to k (x, x ) = i=1 j =0 j xi xi ) = i=1 (1 + xi x )q . Note that this is not exactly the usual i polynomial kernel (whose feature space is the space of multivariate polynomials of total degree less than q ), since our kernel considers polynomials of maximal degree q . Gaussian kernels We also consider Xi = R, and the Gaussian-RBF kernel e-b(x-x ) . The following decomposition is the eigendecomposition of the non centered covariance operator for a normal distribution with variance 1/4a (see, e.g., [7]): k 2 2 2 b b e-b(x-x ) = k=0 (b/A) [e- A (a+c)x Hk ( 2cx)][e- A (a+c)(x ) Hk ( 2cx )], 2k k! 2 where c2 = a2 + 2ab, A = a + b + c, and Hk is the k -th Hermite polynomial. By appropriately truncating the sum, i.e, by considering that the first q basis kernels are obtained from the first q single Hermite polynomials, and the (q + 1)-th kernel is summing over all other kernels, we obtain a decomposition of a uni-dimensional Gaussian kernel into q + 1 components (q of them are one-dimensional, the last one is infinite-dimensional, but can be computed by differencing). The decomposition ends up being close to a polynomial kernel of infinite degree, modulated by an exponential [2]. One may also use an adaptive decomposition using kernel PCA (see, e.g., [2, 1]), which is equivalent to using the eigenvectors of the empirical covariance operator associated with the data (and not the population one associated with the Gaussian distribution with same variance). In simulations, we tried both with no significant differences. Finally, by taking product over all variables, we obtain a decomposition of the p-dimensional Gaussian kernel into (q + 1)p components, that are adapted to nonlinear variable selection. Note that for q = 1, we obtain ANOVA-like decompositions [2]. Kernels or features? In this paper, we emphasize the kernel view, i.e., we are given a kernel (and thus a feature space) and we explore it using 1 -norms. Alternatively, we could use the feature view, i.e., we have a large structured set of features that we try to select from; however, the techniques developed in this paper assume that (a) each feature might be infinite-dimensional and (b) that we can sum all the local kernels efficiently (see in particular Section 3.2). Following the kernel view thus seems slightly more natural. 2.2 Graph-based structured regularization v v 2 = Given V v 2. V Fv , the natural Hilbertian norm is defined through Penalizing with this norm is efficient because summing all kernels kv is assumed feasible in polynomial time and we can bring to bear the usual kernel machinery; however, it does not lead to sparse solutions, where many v will be exactly equal to zero. As said earlier, we are only interested in the hull of the selected elements v Fv , v V ; the hull of a set I is characterized by the set of v , such that D(v ) I c , i.e., such that all descendants of v are in the complement I c : hull(I ) = {v V , D(v ) I c }c . Thus, if we try to estimate hull(I ), we need to determine which v V are such that D(v ) I c . In our context, we are hence looking at selecting vertices v V for which D(v) = (w )wD(v) = 0. v We thus consider the following structured block 1 -norm defined as = V dv D(v ) w v 1/2 , where (dv )vV are positive weights. Penalizing by such a norm D(v ) w 2) V dv ( w will indeed impose that some of the vectors D(v) D(v ) Fw are exactly zero. We thus con1 sider the following minimization problem : v n v 1 min QvV Fv n i=1 (yi , (1 ) V v , v (xi ) ) + 2 V dv D(v ) 2 . Our Hilbertian norm is a Hilbert space instantiation of the hierarchical norms recently introduced by [6]. If all Hilbert spaces are finite dimensional, our particular choice of norms corresponds to an "1 -norm of 2 -norms". While with uni-dimensional groups/kernels, the "1 -norm of -norms" allows an efficient path algorithm for the square loss and when the DAG is a tree [6], this is not possible anymore with groups of size larger than one, or when the DAG is a not a tree. In Section 3, we propose a novel algorithm to solve the associated optimization problem in time polynomial in the number of selected groups/kernels, for all group sizes, DAGs and losses. Moreover, in Section 4, we show under which conditions a solution to the problem in Eq. (1) consistently estimates the hull of the sparsity pattern. Finally, note that in certain settings (finite dimensional Hilbert spaces and distributions with absolutely continuous densities), these norms have the effect of selecting a given kernel only after all of its ancestors [6]. This is another explanation why hulls end up being selected, since to include a given vertex in the models, the entire set of ancestors must also be selected. 3 Optimization problem In this section, we give optimality conditions for the problems in Eq. (1), as well as optimization algorithms with polynomial time complexity in the number of selected kernels. In simulations we consider total numbers of kernels larger than 1030 , and thus such efficient algorithms are essential to the success of hierarchical multiple kernel learning (HKL). 3.1 Reformulation in terms of multiple kernel learning Following [8, 9], we can simply derive an equivalent formulation of Eq. (1). Using Cauchy-Schwarz v 2 1, inequality, we have that for all RV such that 0 and V dv v v w v D(v) 2 -1 = v V ( V dv D(v) )2 A(w ) v ) w 2, V ( v v with equality if and only if v = d-1 D(v) ( V dv D(v) )-1 . We associate to the vector v v -1 - RV , the vector RV such that w V , w 1 = A(w ) v . We use the natural convention that if v is equal to zero, then w is equal to zero for all descendants w of v . We let denote H the set of allowed and Z the set of all associated . The set H and Z are in bijection, and we can interchangeably use H or the corresponding ( ) Z . Note that Z is in general not convex (unless the DAG is a tree, see the technical Appendix), and if Z , then w v for all w D(v ), i.e., weights of descendant kernels are smaller, which is consistent with the known fact that kernels should always be selected after all their ancestors. The problem in Eq. (1) is thus equivalent to n v 1 min Q in m V v , v (xi ) ) + i=1 (yi , n H vV Fv 2 ~ ~ Using the change of variable v = and (x) = , this implies that given the optimal (and associated ), cw rresponds to the solution of the regular supervised learning o i problem with kernel matrix K = V w Kw , where Kw ns n × n the kernel matrix associated with kernel kw . Moreover, the solution is then w = w i=1 i w (xi ), where Rn are the dual parameters associated with the single kernel learning problem. Following [5], we consider the square of the norm, which does not change the regularization properties, but allow simple links with multiple kernel learning. 1 -1/2 v v 1/2 (v v (x))vV w V w ( )-1 w 2. (2 ) Thus, the solution is entirely determined by Rn and RV (and its corresponding RV ). More precisely, we have (see proof in the technical Appendix): n Proposition 1 The pair (, ) is optimal for Eq. (1), with w, w = w i=1 i w (xi ), if and only if (a) given , is optimal for the single kernel lwarning vproblem with kernel matrix K = e w -1 -1 Kw . A(w ) v ) V ( V w ( )Kw , and (b) given , H maximizes Moreover, the total duality gap can be upperbounded as the sum of the two separate duality gaps for the two optimization problems, which will be useful in Section 3.2 (see technical Appendix for more details). Note that in the case of "flat" regular multiple kernel learning, where the DAG has no edges, we obtain back usual optimality conditions [8, 9]. Following a common practice for convex sparsity problems [10], we will try to solve a small problem where we assume we know the set of v such that D(v) is equal to zero (Section 3.3). We then "simply" need to check that variables in that set may indeed be left out of the solution. In the next section, we show that this can be done in polynomial time although the number of kernels to consider leaving out is exponential (Section 3.2). 3.2 Conditions for global optimality of reduced problem We let denote J the complement of the set of norms which are set to zero. We thus consider the optimal solution of the reduced problem (on J ), namely, v v n 1 minJ QvJ Fv n i=1 (yi , 2, (3 ) J v , v (xi ) ) + 2 V dv D(v )J with optimal primal variables J , dual variables and optimal pair (J , J ). We now consider necessary conditions and sufficient conditions for this solution (augmented with zeros for non active variables, i.e., variables in J c ) to be optimal with respect to the full problem in Eq. (1). We denote v by = J dv D(v )J the optimal value of the norm for the reduced problem. Proposition 2 (NJ ) If the reduced solution is optimal for the full problem in Eq. (1) and all kernels in the extreme points of J are active, then we have maxtsources(J c ) Kt /d2 2 . t w v Proposition 3 (SJ, ) If maxtsources(J c ) Kw /( A(w)D(t) dv )2 2 + /, D(t) then the total duality gap is less than . The proof is fairly technical and can be found in the technical Appendix; this result constitutes the main technical contribution of the paper: it essentially allows to solve a very large optimization problem over exponentially many dimensions in polynomial time. The necessary condition (NJ ) does not cause any computational problems. However, the sufficient condition (SJ, ) requires to sum over all descendants of the active kernels, which is impossible in practice (as shown in Section 5, we consider V of cardinal often greater than 1030 ). Here, we need to bring to bear the specific structure of the kernel k . In the contv xt of directed grids we consider e in this paper, if dv can also be decomposed as a product, then A(w )D(t) dv is also factorized, and we can compuve the sum over all v D(t) in linear time in p. Moreover we can cache the sums t w 2 A(w )D(t) dv ) in order to save running time. D(t) Kw /( 3.3 Dual optimization for reduced or small problems When kernels kv , v V have low-dimensional feature spaces, we may use a primal representation and solve the problem in Eq. (1) using generic optimization toolboxes adapted to conic constraints (see, e.g., [11]). However, in order to reuse existing optimized supervised learning code and use high-dimensional kernels, it is preferable to use a dual optimization. Namely, we use thn same tecv nique as [8]: we consider for Z , the function B ( ) = e h w 1 -1 w 2, which is the optimal value min QvV Fv n i=1 (yi , V v , v (xi ) )+ 2 V w w of the single kernel learning problem with kernel matrix V w Kw . Solving Eq. (2) is equivalent to minimizing B ( ( )) with respect to H . If a ridge (i.e., positive diagonal) is added to the kernel matrices, the function B is differentiable. Moreover, the function ( ) is differentiable on (R )V . Thus, the function B [ ((1 - + ) + |V | d-2 )] , where d-2 is the vector with elements d-2 , is differentiable if > 0. We can then v use the same projected gradient descent strategy as [8] to minimize it. The overall complexity of the algorithm is then proportional to O(|V |n2 )--to form the kernel matrices--plus the complexity of solving a single kernel learning problem--typically between O(n2 ) and O(n3 ). 3.4 Kernel search algorithm We are now ready to present the detailed algorithm which extends the feature search algorithm of [10]. Note that the kernel matrices are never all needed explicitly, i.e., we only need them (a) explicitly to solve the small problems (but we need only a few of those) and (b) implicitly to compute the sufficient condition (SJ, ), which requires to sum over all kernels, as shown in Section 3.2. · Input: kernel matrices Kv Rn×n , v V , maximal gap , maximal # of kernels Q · Algorithm 1. Initialization: set J = sources(V ), compute (, ) solutions of Eq. (3), obtained using Section 3.3 2. while (NJ ) and (SJ, ) are not satisfied and #(V ) Q ­ If (NJ ) is not satisfied, add violating variables in sources(J c ) to J else, add violating variables in sources(J c ) of (SJ, ) to J ­ Recompute (, ) optimal solutions of Eq. (3) · Output: J , , The previous algorithm will stop either when the duality gap is less than or when the maximal number of kernels Q has been reached. In practice, when the weights dv increase with the depth of v in the DAG (which we use in simulations), the small duality gap generally occurs before we reach a problem larger than Q. Note that some of the iterations only increase the size of the active sets to check the sufficient condition for optimality; forgetting those does not change the solution, only the fact that we may actually know that we have an -optimal solution. In the directed p-grid case, the total running time complexity is a function of the number of observations n, and the number R of selected kernels; with proper caching, we obtain the following complexity, assuming O(n3 ) for the single kernel learning problem, which is conservative: O(n3 R + n2 Rp2 + n2 R2 p), which decomposes into solving O(R) single kernel learning problems, caching O(Rp) kernels, and computing O(R2 p) quadratic forms for the sufficient conditions. 4 Consistency conditions As said earlier, the sparsity pattern of the solution of Eq. (1) will be equal to its hull, and thus we can only hope to obtain consistency of the hull of the pattern, which we consider in this section. For simplicity, we consider the case of finite dimensional Hilbert spaces (i.e., Fv = Rfv ) and the square loss. We also hold fixed the vertex set of V , i.e., we assume that the total number of features is fixed, and we let n tend to infinity and = n decrease with n. Following [4], we make the following assumptions on the underlying joint distribution of (X, Y ): (a) the joint covariance matrix of ((xv ))vV (defined with appropriate blocks of size fv × fw ) w 2 is invertible, (b) E (Y |X ) = W w , w (x) with W V and var(Y |X ) = > 0 almost surely. With these simple assumptions, we obtain (see proof in the technical Appendix): Proposition 4 (Sufficient condition) If tsources(W c ) max w D(t) wW -1W Diag(dv D(v) -1 )vW W 2 W P ( vA(w)D(t) dv )2 < 1, then and the hull of W are consistently estimated when n n1/2 and n 0. Proposition 5 (Necessary condition) If the and the hull of W are consistently estimated for -1 some sequence n , then maxtsources(W c ) wW W W Diag(dv / D(v) )vW W 2/d2 1. t Note that the last two propositions are not consequences of the similar results for flat MKL [4], because the groups that we consider are overlapping. Moreover, the last propositions show that we indeed can estimate the correct hull of the sparsity pattern if the sufficient condition is satisfied. In particular, if we can make the groups such that the between-group correlation is as small as possible, we can ensure correct hull selection. Finally, it is worth noting that if the ratios dw / maxvA(w) dv tend to infinity slowly with n, then we always consistently estimate the depth of the hull, i.e., the optimal interaction complexity. We are currently investigating extensions to the non parametric case [4], in terms of pattern selection and universal consistency. 1 test set error test set error HKL greedy L2 1 HKL greedy L2 0.5 0.5 0 2 3 4 5 log (p) 2 6 7 0 2 3 4 5 log (p) 2 6 7 Figure 2: Comparison on synthetic examples: mean squared error over 40 replications (with halved standard deviations). Left: non rotated data, right: rotated data. See text for details. dataset abalone abalone b an k - 3 2 f h b an k - 3 2 f h b an k - 3 2 f m b an k - 3 2 f m b an k - 3 2 n h b an k - 3 2 n h b an k - 3 2 n m b an k - 3 2 n m boston boston p u m ad y n - 3 2 f h p u m ad y n - 3 2 f h p u m ad y n - 3 2 f m p u m ad y n - 3 2 f m p u m ad y n - 3 2 n h p u m ad y n - 3 2 n h p u m ad y n - 3 2 n m p u m ad y n - 3 2 n m n 4177 4177 8192 8192 8192 8192 8192 8192 8192 8192 506 506 8192 8192 8192 8192 8192 8192 8192 8192 p 10 10 32 32 32 32 32 32 32 32 13 13 32 32 32 32 32 32 32 32 k pol4 rb f pol4 rb f pol4 rb f pol4 rb f pol4 rb f pol4 rb f pol4 rb f pol4 rb f pol4 rb f pol4 rb f #(V ) 107 1010 1022 1031 1022 1031 1022 1031 1022 1031 109 1012 1022 1031 1022 1031 1022 1031 1022 1031 L2 44.2±1.3 43.0±0.9 40.1±0.7 39.0±0.7 6.0±0.1 5.7±0.2 44.3±1.2 44.3±1.2 17.2±0.6 16.9±0.6 17.1±3.6 16.4±4.0 57.3±0.7 57.7±0.6 6.9±0.1 5.0±0.1 84.2±1.3 56.5±1.1 60.1±1.9 15.7±0.4 g r eed y 43.9±1.4 45.0±1.7 39.2±0.8 39.7±0.7 5.0±0.2 5.8±0.4 46.3±1.4 49.4±1.6 18.2±0.8 21.0±0.6 24.7±10.8 32.4±8.2 56.4±0.8 72.2±22.5 6.4±1.6 46.2±51.6 73.3±25.4 81.3±25.0 69.9±32.8 67.3±42.4 lasso- 47.9±0.7 49.0±1.7 41.3±0.7 66.1±6.9 7.0±0.2 36.3±4.1 45.8±0.8 93.0±2.8 19.5±0.4 62.3±2.5 29.3±2.3 29.4±1.6 57.5±0.4 89.3±2.0 7.5±0.2 44.7±5.7 84.8±0.5 98.1±0.7 78.5±1.1 95.9±1.9 M KL 44.5±1.1 43.7±1.0 38.7±0.7 38.4±0.7 6.1±0.3 5.9±0.2 46.0±1.2 46.1±1.1 21.0±0.7 20.9±0.7 22.2±2.2 20.7±2.1 56.4±0.7 56.5±0.8 7.0±0.1 7.1±0.1 83.6±1.3 83.7±1.3 77.5±0.9 77.6±0.9 HKL 43.3±1.0 43.0±1.1 38.9±0.7 38.4±0.7 5.1±0.1 4.6±0.2 43.6±1.1 43.5±1.0 16.8±0.6 16.4±0.6 18.1±3.8 17.1±4.7 56.4±0.8 55.7±0.7 3.1±0.0 3.4±0.0 36.7±0.4 35.5±0.5 5.5±0.1 7.2±0.1 Table 1: Mean squared errors (multiplied by 100) on UCI regression datasets, normalized so that the total variance to explain is 100. See text for details. 5 Simulations Synthetic examples We generated regression data as follows: n = 1024 samples of p [22 , 27 ] variables were generated from a random covariance matrix, and the label y R was sampled as a random sparse fourth order polynomial of the input variables (with constant number of monomials). We then compare the performance of our hierarchical multiple kernel learning method (HKL) with the polynomial kernel decomposition presented in Section 2 to other methods that use the same kernel and/or decomposition: (a) the greedy strategy of selecting basis kernels one after the other, a procedure similar to [12], and (b) the regular polynomial kernel regularization with the full kernel (i.e., the sum of all basis kernels). In Figure 2, we compare the two approaches on 40 replications in the following two situations: original data (left) and rotated data (right), i.e., after the input variables were transformed by a random rotation (in this situation, the generating polynomial is not sparse anymore). We can see that in situations where the underlying predictor function is sparse (left), HKL outperforms the two other methods when the total number of variables p increases, while in the other situation where the best predictor is not sparse (right), it performs only slightly better: i.e., in non sparse problems, 1 -norms do not really help, but do help a lot when sparsity is expected. UCI datasets For regression datasets, we compare HKL with polynomial (degree 4) and GaussianRBF kernels (each dimension decomposed into 9 kernels) to the following approaches with the same kernel: regular Hilbertian regularization (L2), same greedy approach as earlier (greedy), regularization by the 1 -norm directly on the vector , a strategy which is sometimes used in the context of sparse kernel learning [13] but does not use the Hilbertian structure of the kernel (lasso-), multiple kernel learning with the p kernels obtained by summing all kernels associated with a single variable, a strategy suggested by [5] (MKL). For all methods, the kernels were held fixed, while in Table 1, we report the performance for the best regularization parameters obtained by 10 random half splits. dataset mushrooms mushrooms ringnorm ringnorm spambase spambase twonorm twonorm magic04 magic04 n 1024 1024 1024 1024 1024 1024 1024 1024 1024 1024 p 117 117 20 20 57 57 20 20 10 10 k pol4 rb f pol4 rb f pol4 rb f pol4 rb f pol4 rb f #(V ) 1082 10112 1014 1019 1040 1054 1014 1019 107 1010 L2 0.4±0.4 0.1±0.2 3.8±1.1 1.2±0.4 8.3±1.0 9.4±1.3 2.9±0.5 2.8±0.6 15.9±1.0 15.7±0.9 g r eed y 0.1±0.1 0.1±0.2 5.9±1.3 2.4±0.5 9.7±1.8 10.6±1.7 4.7±0.5 5.1±0.7 16.0±1.6 17.7±1.3 HKL 0.1±0.2 0.1±0.2 2.0±0.3 1.6±0.4 8.1±0.7 8.4±1.0 3.2±0.6 3.2±0.6 15.6±0.8 15.6±0.9 Table 2: Error rates (multiplied by 100) on UCI binary classification datasets. See text for details. We can see from Table 1, that HKL outperforms other methods, in particular for the datasets bank32nm, bank-32nh, pumadyn-32nm, pumadyn-32nh, which are datasets dedicated to non linear regression. Note also, that we efficiently explore DAGs with very large numbers of vertices #(V ). For binary classification datasets, we compare HKL (with the logistic loss) to two other methods (L2, greedy) in Table 2. For some datasets (e.g., spambase), HKL works better, but for some others, in particular when the generating problem is known to be non sparse (ringnorm, twonorm), it performs slightly worse than other approaches. 6 Conclusion We have shown how to perform hierarchical multiple kernel learning (HKL) in polynomial time in the number of selected kernels. This framework may be applied to many positive definite kernels and we have focused on polynomial and Gaussian kernels used for nonlinear variable selection. In particular, this paper shows that trying to use 1 -type penalties may be advantageous inside the feature space. We are currently investigating applications to other kernels, such as the pyramid match kernel [14], string kernels, and graph kernels [2]. References [1] B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002. ¨ [2] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Camb. U. P., 2004. [3] P. Zhao and B. Yu. On model selection consistency of Lasso. J. Mach. Learn. Res., 7:2541­2563, 2006. [4] F. R. Bach. Consistency of the group Lasso and multiple kernel learning. Technical Report 00164735, HAL, 2008. [5] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In Proc. ICML, 2004. [6] P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite absolute penalties. Annals of Statistics, To appear, 2008. [7] C. K. I. Williams and M. Seeger. The effect of the input density distribution on kernel-based classifiers. In Proc. ICML, 2000. [8] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet. More efficiency in multiple kernel learning. In Proc. ICML, 2007. [9] M. Pontil and C.A. Micchelli. Learning the kernel function via regularization. J. Mach. Learn. Res., 6:1099­1125, 2005. [10] H. Lee, A. Battle, R. Raina, and A. Ng. Efficient sparse coding algorithms. In NIPS, 2007. [11] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2003. [12] K. Bennett, M. Momma, and J. Embrechts. Mark: A boosting algorithm for heterogeneous kernel models. In Proc. SIGKDD, 2002. [13] V. Roth. The generalized Lasso. IEEE Trans. on Neural Networks, 15(1), 2004. [14] K. Grauman and T. Darrell. The pyramid match kernel: Efficient learning with sets of features. J. Mach. Learn. Res., 8:725­760, 2007.