Composite Kernel Learning Marie Szafranski 1 MARIE.SZAFRANSKI@HDS.UTC.FR Yves Grandvalet 1 , 2 Y V E S . G R A N DVA L E T @ I D I A P. C H Alain Rakotomamonjy 3 A L A I N . R A K OT O M A M O N J Y @ I N S A - RO U E N . F R 1 ´ ` ` Heudiasyc, UMR CNRS 6599, Universite de Technologie de Compiegne, 60205 Compiegne cedex, France 2 IDIAP Research Institute, Centre du Parc, P.O. Box 592, 1920 Martigny, Switzerland 3 ´ LITIS EA 4051, UFR de Sciences, Universite de Rouen, 76800 Saint Etienne du Rouvray, France Abstract The Support Vector Machine (SVM) is an acknowledged powerful tool for building classifiers, but it lacks flexibility, in the sense that the kernel is chosen prior to learning. Multiple Kernel Learning (MKL) enables to learn the kernel, from an ensemble of basis kernels, whose combination is optimized in the learning process. Here, we propose Composite Kernel Learning to address the situation where distinct components give rise to a group structure among kernels. Our formulation of the learning problem encompasses several setups, putting more or less emphasis on the group structure. We characterize the convexity of the learning problem, and provide a general wrapper algorithm for computing solutions. Finally, we illustrate the behavior of our method on multi-channel data where groups correpond to channels. itself, since f H, f (x) = i=1 i K (xi , x) ; (ii) a meic, and hence a smoothness functional in H: f 2 = tr H i=1 j =1 i i K (xi , xj ) ; (iii) a distance between observations: (x) - (x ) 2 = K (x, x) + K (x , x ) - 2K (x, x ) . In this paper, we devise Composite Kernel Learning (CKL), a framework where the kernel is learned in a way to favor the selection of variables or groups of variables. Section 2 motivates our approach while briefly reviewing the different means proposed to extend kernel methods beyond the predefined kernel setup. We then follow in Section 3 by considering some recent developments in variable selection that are relevant for our aims. Section 4 describes the CKL framework; the optimization algorithm is provided in Section 5, and experiments are reported in Section 6. 2. Flexible Kernel Methods From now on, we restrict our discussion to classification, where, from a learning set S = {(xi , yi )}n 1 of pairs of i= observations and label (xi , yi ), one aims at building a decision rule that predicts the class label y of any observation x. We furthermore focus on the binary case, where (xi , yi ) X × {±1}. However, it should be kept in mind that most of our observations carry on to other settings, such as multiclass classification, clustering or regression with kernel methods. 2.1. Support Vector Machines A SVM builds the decision rule sign (f (x) + b ), where f and b are defined as the solution of min 1 f 2 + C i n i H 2 f ,b, = 1 f (1) 1 - i 1 i n s. t. yi (xi ) + b i 0 1in . The regularization parameter C is the only adjustable parameter in this procedure. This is usually not flexible 1. Motivation Kernel methods have been extensively used in learning ¨ problems (Scholkopf & Smola, 2001). In these models, the observations are implicitly mapped in a feature space via a mapping : X H, where H is a Reproducing Kernel Hilbert Space (RKHS) with reproducing kernel K : X × X R. We address the problem of learning the kernel in Support Vector Machines (SVM) and related methods. Indeed, the kernel is crucial in many respects, and its relevance is essential to the success of kernel methods. Formally, the primary role of K is to define the evaluation functional in H: f H, f (x) = f, K (x, ·) H, but K also defines (i) H Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Composite Kernel Learning enough to provide good results when the kernel is chosen prior to seeing data. Hence, most applications of SVM incorporate a mechanism for learning the kernel. 2.2. Learning the Kernel Cross-validation is the most rudimentary, but also the most common way to learn the kernel. It consists in (i) defining a family of kernels (e.g. Gaussian), indexed by one or more parameters (e.g. bandwidth), the so-called kernel hyper-parameters, (ii) running the SVM algorithm on each hyper-parameter setting, and (iii) finally choosing the hyper-parameter minimizing a cross-validation score. A thorough discussion of the pros and cons of crossvalidation is out of the scope of this paper, but it is clear that this approach is inherently limited to one or two hyperparameters and few trial values. This observation led to several proposals allowing for more flexibility. 2 . 2 . 1 . F I LT E R S , W R A P P E R S & E M B E D D E D M E T H O D S Learning the kernel amounts to learn the feature mapping. It should thus be of no surprise that the approaches investigated bear some similarities with the ones developed for variable selection, where one encounters filters, wrappers and embedded methods (Guyon & Elisseeff, 2003). Some general frameworks do not belong to a single category but the distinction is appropriate in most cases. In filter approaches, the kernel is adjusted before building the SVM, with no explicit relationship to the objective value of Problem (1). For example, the kernel target alignment of Cristianini et al. (2002) adapts the kernel to the available data without training any classifier. In wrapper algorithms, the SVM solver is the inner loop of two nested optimizers, whose outer loop is dedicated to adjust the kernel. This tuning may be guided by various generalization bounds (Cristianini et al., 1999; Weston et al., 2001; Chapelle et al., 2002). Kernel learning can also be embedded in Problem (1), with the SVM objective value minimized jointly with respect to the SVM parameters and the kernel hyper-parameters (Grandvalet & Canu, 2003). Our approach, which belongs to this family of methods, is based on the Multiple Kernel Learning (MKL) framework (Lanckriet et al., 2004). 2 . 2 . 2 . M U LT I P L E K E R N E L L E A R N I N G MKL is a joint optimization problem of the coefficients of the SVM classifier and a convex combination of kernels that defines the actual SVM kernel K (x, x ) = M m =1 where each kernel Km is associated to a RKHS Hm whose elements will be denoted fm , and {m }M=1 are coeffim cients to be learned under the convex combination constraints M m =1 m = 1 , m 0 , 1 m M . (3) Bach et al. (2004) proposed the following formulation of MKL 1 : m 2 i fm Hm + C i min 1 f1 ,...,fM , 2 b, m (4) fm (xi ) + b 1 - i 1 i n s . t . yi i 0 1 i n, whose solution leads to a decision rule of the form mm sign ( f (x) + b ). This expression of the learning problem is remarkable in that it only deviates slightly from the original SVM problem (1). The squared RKHS norm in H is simply replaced by a mixed-norm, with the standard RKHS norm within each feature space Hm , and an 1 norm in RM on the vector built by concatenating these norms. This 1 norm encourages sparse solutions, that is, solutions where some functions fm have zero norm. In this respect, the MKL problem may be seen as the kernelization of the group-LASSO (Yuan & Lin, 2006). 2.2.3. COMPOSITE KERNEL LEARNING When the individual kernels Km represent a series, such as Gaussian kernels with different scale parameters, MKL may be used as an alternative to cross-validation. When the input data originates from M differents sources, and that each kernel is affiliated to one input variable, MKL can be used to select relevant input variables. However, MKL is not meant to address problems where several kernels pertain to one input variable. In this situation, the sparseness mechanism of MKL does not favor solutions discarding all the kernels computed from an irrelevant input. Although most of the related coefficients should vanish in combination (2), spurious correlation may cause irrelevant input variables to participate to the solution. The flat combination of kernels in MKL does not include a mechanism to cluster the kernels related to one input variable. In order to favor the selection of kernels within predefined groups, one has to define a group structure among kernels, which will guide the selection process through a structured kernel combination. This type of hierarchy among 1 To lighten notations, the range of indexes is often omitted in summations, in which case: indexes i and j refer to examples and go from 1 to n; index m refers to kernels and goes from 1 to M ; index refers to groups of kernels and goes from 1 to L. m Km (x, x ) , (2) Composite Kernel Learning variables has been investigated in linear models (Szafranski et al., 2008; Zhao et al., to appear). We briefly recapitulate the general framework in the following section, before discussing its adaptation to kernel learning in Section 4. on the two last lines encourage sparseness in 1, which induces sparseness in m . and 2,m , 3. Grouped and Hierarchical Selection The introduction of 1 penalties, with the seminal paper of Tibshirani (1996) on the LASSO, gave rise to many important theoretical and practical advances in the statistics and machine learning fields. As stated in Section 2.2.2, MKL itself belongs to the series of algorithms affiliated to the LASSO, through its relationship with group-LASSO. In this lineage, Zhao et al. (to appear) defined the very general Composite Absolute Penalties (CAP) family. 3.1. Composite Absolute Penalties Consider a linear model with M parameters, = (1 , . . . , M ) t , and let I = {1, . . . , M } be a set of index on these parameters. A group structure on the parameters is defined by a series of L subsets {G }L , where G I . =1 Additionally, let { }L be L + 1 norm parameters. Then, =0 the member of the CAP family for the chosen groups and norm parameters is = m G Here, the groups G form a partition of I , and the hierarchy refers to the tree-structure of the shrinking coefficients: 2,m shrinks parameter m , while 1, shrinks the parameters for group G . In the words of Zhao et al., the objective here is grouped variable selection. The minimizer of Problem (6) is the minimizer of d min J ( ) + 1/4 m G | m |4/3 3 /4 2 , is essentially a CAP estimate, where parameter d nly accounts for the group sizes (Szafranski et al., 2008). The inner 4/3 norm and the outer 1 norm form a mixednorm penalty that will be denoted (4/3,1) . The overall penalizer favors sparse solutions at the group level, with few leading coefficients within the selected groups. w o hich 4. From Multiple to Composite Kernels MKL has been formalized as a quadratically constrained program by Lanckriet et al. (2004), then as a second-order cone program by Bach et al. (2004). More recently, other formulations led to wrapper algorithms, where the optimization with respect to kernel hyper-parameters is still based on the SVM objective value, but is performed in an outer loop that wraps a standard SVM solver. The outer loop is cutting planes for Sonnenburg et al. (2006), and gradient descent for Rakotomamonjy et al. (2007). Wrapper algorithms have appealing features: (i) they benefit from the developments of solvers specifically tailored for the SVM problem in the inner loop; (ii) they allow to address large-scale problems; (iii) they are multipurpose, since the SVM inner loop may be replaced by another algorithm with little or no adjustments. We chose to build on gradient-based MKL. First, it has been shown to be more efficient than the SILP approach of Sonnenburg et al. (2006), thanks to the stability of the updates performed in the outer loop, which induces good initializations for the inner loop solver (Rakotomamonjy et al., 2007). Second, and even more important for our purpose, gradient-based MKL is amenable to the extension to groups of kernels, thanks to the formulation of hierarchical penalization of Section 3.2. 4.1. Variational Multiple Kernel Learning Problem (4) is not differentiable at fm Hm = 0, a difficulty that causes a considerable algorithmic burden. The MKL formulation of Rakotomamonjy et al. (2007) can be viewed as a variational form of Problem (4), where M new variables 1 , . . . , M are introduced in order to avoid m | | 0 / . (5) Mixed-norms correspond to groups defined as a partition of the set of variables. A CAP may also rely on nested groups, G1 G2 . . . GL , and 0 = 1, in which case it favors what Zhao et al. call hierarchical selection, that is, the selection of groups of variables in the predefined order {I \ GL }, {GL \ GL-1 }, . . . , {G2 \ G1 }, G1 . This example is provided here to stress that Zhao et al.'s notion of hierarchy differs from the one that follows. 3.2. Hierarchical Penalization Hierarchical penalization uses shrinking coefficients to transform a ridge-like penalty into a sparse penalizer (Szafranski et al., 2008). The model parameterized by is fitted by minimizing a differentiable loss function J (·), subject to a ridge penalty with adaptive coefficients that encourages sparseness among and within groups: 2 m m min J ( ) + 1, 2,m ,1 ,2 G d s. t. 1 L (6) 1, = 1 , 1, 0 m =1 , 0 1mM. 2,m 2,m The Lagrange parameter controls the amount of shrinkage, and d is the size of group . The constraints expressed Composite Kernel Learning these differentiability issues. The resulting problem, which is equivalent to Problem (4), is stated as: m1 i 1 i f ,m.in , 2 H m fm 2 m + C 1 .. ,fM b,, m s. t. yi fm (xi ) + b 1 - i 1 i n (7) i 0 1in m =1 , 0 1mM. m m hence Problem (8) is equivalent to m1 i i min 1 H m fm 2 m + C f1 ,...,fM , 2 b,,1 , m s . t . yi fm (xi ) + b 1 - i 1 i n i 0 1in (9) d = 1 , 1, 0 1 L 1, -p/q 1/q m 1 1, m G m 0 1mM. The new problem is simplified further by showing that 1 can be dropped out from the optimization process, leading to the following formulation of Composite Kernel Learning (CKL): m1 i i min 1 H m fm 2 m + C f1 ,...,fM , 2 b,, m s. t. yi fm (xi ) + b 1 - i 1 i n i 0 1 i n (10) q 1/(p+q) dp 1/q 1 m m G m 0 1mM, Before considering particular settings of interest, we state below two helpful propositions. The first one gives a more interpretable formulation of Problem (10); the second one presents the conditions for convexity of formulation (10), that will guaranty the convergence towards the global minimum for the algorithm described in Section 5. Proposition 1. CAP Formulation: Problem (10) is equivalent to the following MKL problem with a CAP-like penalty on the RKHS norms: d 0 / 2/0 i min 1 + C i m Hm f ,...,f , 2 m 1 M G f b, m (11) s. t. yi fm (xi ) + b 1 - i 1 i n i 0 1 i n, with = 2 q +1 , Here and in what follows, u/v is defined by continuation at zero as u/0 = if u = 0 and 0/0 = 0. The constraints expressed on the last line encourage sparseness in m , which induces sparseness in fm . As already mentioned in Section 2.2.2, the sparseness applies at the kernel level, ignoring the group structure. The latter is taken into account in the formulation proposed in the following section. 4.2. Variational Composite Kernel Learning Here, we build on the variational form of the composite absolute penalties presented in Section 3.2 to take into account the group structure. Hierarchical penalization can deal with kernel methods if the ridge penalties are replaced by RKHS norms. We first generalize Problem (6) to obtain smooth variational formulations for arbritrary mixed-norm penalties, so that to address a wide variety of problems including MKL: -p i -q 1 i min 2 2,m fm 2 m + C H 1, m f1 ,...,fM , G b,, , 1 2 s. t. y m f (x ) + b 1 - 1 i n i m i i i 0 d m 1, = 1 , 1, 1in 0 1 L 1 m M, (8) 2,m = 1 , 2,m 0 0 = 2 p+q +1 and = 1 - 0 . where p and q are exponents to be set according to the problem at hand. This formulation, which is difficult to optimize, is simplified by replacing the two shrinking coefficients 1 and 2 p q by , defined by m = 1, 2,m . In a first step, we consider the change of variable that maps 2 to . When q = 0, this mapping is one-to-one provided 1, = 0. Furthermore, if 1, and 2,m denote the optimal 1, and 2,m values for Problem (8), we have that 1, = 0 2,m = 0, Sketch of proof. Let L be the Lagrangian of problem (10). The optimality conditions for m are obtained from the first order optimality conditions for m ( L = 0): m m = where s d 0 -2)/0 0 / d- s fm 2- Hm s ( , (12) = m G f m m H . Plugging this expression in Problem (10) yields the claimed result. Composite Kernel Learning 2 Note that the outer exponent 0 only influences the strength of the penalty, not its type. Hence, the penalty in the objective function (11) differs from (5) in the RKHS norms · Hm and in the parameters d that accommodate for group sizes. 5.1. A Gradient-Based Wrapper The wrapper scheme considers the following constrained optimization problem: min J ( ) q 1/(p+q) dp 1/q s. t. m 1 m m 0 , 1mM, where J ( ) is defined as the objective value of 1 f ,m.in , 2 1 .. ,fM b, m G m 1 Proposition 2. Conditions for Convexity: Problem (10) is convex if and only if 0 q 1 and 0 p + q 1. Proof. A problem minimizing a convex criterion on a convex set is convex. The objective function of Problem (10) is convex (Boyd & Vandenberghe, 2004, p. 89). The first, second and fourth constraints mefine convex sets, and the d q 1/q third one also provided (i) is a norm, that G m t1/(p+q) is 0 q 1, and (ii) is convex in t , that is 0 p + q 1. fm 2 m + C H i i m fm (xi ) + b 1 - i , 1 i n (13) s. t. yi i 0 , 1 i n . The global optimization problem consists thus of two nested problems. In the inner loop, the criterion is optimized with respect to f1 , . . . , fM , b and , considering that the coefficients are fixed. In the outer loop, is updated to decrease the criterion, with fm , b and being fixed. Equation (12) may be used to update in closed form. However, this approach lacks convergence guarantees and may lead to numerical problems, in particular when some elements of approach zero. Hence, following Rakotomamonjy et al. (2007), we use that the objective function J ( ) is actually an optimal SVM objective value to update by an efficient projected gradient descent scheme. 5.2. Computing the Gradient The dual formulation offers a convenient means to compute the gradient J(). The derivation of the Lagrangian of Problem (13), which is omitted here for brevity, shows that its dual formulation is identical to the one of a standard SVM using the aggregated kernel K defined in Equation (2). Hence, the dual problem takes the usual form i i 1 i j yi yj K (xi , xj ) + i m x- 2 a i ,j s. t. i yi = 0 C i 0 1in , which can be solved by any SVM solver. As J ( ) is defined as the optimal objective value of the convex Problem (13) for which strong duality applies, J ( ) is also the dual objective value: J ( ) = - where s Within the values of p and q ensuring convexity, we pick the following particular cases of interest: · p = 0, q = 1 yields a LASSO type penalty on the RKHS norms. It results in the generalization of the group-LASSO known as MKL, as formulated in (4); · p = 1, q = 0 yields a group-LASSO type penalty on the RKHS norms. It results in anotherm KL, with L M effective kernels K , defined as K = m; G K · p = q = 1 yields a hierarchical-penalization type 2 penalty on the RKHS norms. It is a true CKL, where there are M effective kernels, and where the penalty favors sparse solutions at the group level, with few leading kernels within the selected groups. Hence, when p goes from zero to one, with q = 1 - p, the penalty gives more and more emphasis to the group structure. For most applications where convexity is a key issue, we recommend the balanced setup p = q = 1 . 2 Note however that convex penalties restrict the sparseness of the solution to either the group level or the kernel level. In Section 6, we will illustrate that giving up convexity may turn out to be an interesting option when considering interpretability issues. (14) 5. Algorithm Our approach to solve Problem (10) draws on the MKL algorithm of Rakotomamonjy et al. (2007). We use the wrapper scheme described below, where the outer loop is carried out by a projected gradient descent update. i 1i i j yi yj K (xi , xj ) + i , (15) 2 ,j olves Problem (14). Composite Kernel Learning The existence and computation of the derivatives of J (·) follow from general results on optimal values, such as Theorem 4.1 of Bonnans and Shapiro (1998), which, in a nutshell states that the differentiability of J ( ) is ensured by the unicity of , and by the differentiability of (15). 2 Furthermore, the derivatives of J ( ) can be computed as if were not to depend on . Thus, the gradient J() is simply 1i J =- i j yi yj m 2 ,j 5.3. CKL Algorithm Now, we have all the ingredients to adapt the machinery developed for MKL by Rakotomamonjy et al. (2007). According to the process described in Section 5.1, we propose Algorithm 1. Algorithm 1 Composite Kernel Learning initialize solve the SVM problem J ( ) repeat compute direction d = - J() repeat compute d , the projection of d onto the tangent of the surface of the admissible set compute the smallest step that nullifies a component of j S= : dj < 0 and j = 0 j j = min - j k = arg min - j dk = 0 j S j S d d p =+ d roject onto the surface of the admissible set solve the SVM problem J ( ) if J ( ) < J ( ) then = until J ( ) J ( ) compute = arg min J ( + d) =+ d until convergence The stopping criterion for assessing the convergence of the outer loop can be based on standard criteria for gradientbased algorithms or on the duality gap. In the following experiments, it is based on the stability of and J ( ). m G m (xi , xj ) K recorded from specific electrode positions. However, as ¨ stated by Schroder et al. (2005), automated channel selection should be performed for each single subject since it leads to better performances or a substantial reduction of the number of useful channels. Reducing the number of channels involved in the decision function is of primary importance for BCI real-life applications, since it makes the acquisition system easier to use and to set-up. We use here the dataset from the BCI 2003 competition for the task of interfacing the P300 Speller (Blankertz et al., 2004). The dataset consists in 7560 EEG signals paired with positive or negative stimuli responses. The signal, processed as in (Rakotomamonjy et al., 2005), leads to 7560 examples of dimension 896 (14 time frames for each of the 64 channels). The experimental protocol is then the following: we have randomly picked 567 training examples from the datasets and used the remaining as testing examples. For each parameter, C has been selected by retaining a small part of the training set as a validation set, for selecting the parameter which the highest AUC. This overall procedure has been repeated 10 times. Using a small part of the examples for training can be justified by the use of ensemble of SVMs (that we do not consider here) on a latter stage of the EEG classification procedure (Rakotomamonjy et al., 2005), and the AUC performance measure is justified by how the EEG recognition is transformed into selected character in the P300. The 896 features extracted from the EEG signals are not tranformed before classification: we do not use any kernelization. However, to unify the presentation, we will refer to these features as linear kernels. Hence, in this application where the kernels related to a given channel form a group of kernels, we have to learn M = 896 coefficients m , divided into L = 64 groups. CKL is well-suited to the classification objectives, since we aim at classifying the EEG trials with as few channels as possible. Furthermore, it is also likely that some time frames are irrelevant, so that variable selection may be carried out within each channel. To reach a sparse solution at the channel and the time frame levels, we test a non-convex parametrization of CKL that encourages sparseness within and between groups. In the following, CKL1/2 stands for a convex version of our algorithm, with p = q = 1/2 (a (4/3,1) mixednorm), CKL1 is a non-convex version, with p = q = 1 (a (1,2/3) (pseudo) mixed-norm). Note that MKL is also implemented by our algorithm, with p = 0 and q = 1. Table 1 summarizes the average performance of SVM, MKL, and CKL, that is, for 4 different penalization terms: quadratic penalization for the classical SVM (which is . 6. Channel Selection for BCI This experiment deals with single trial classification of EEG signals coming from Brain-Computer Interface (BCI). Depending on each BCI paradigm, these EEG signals are The unicity of is ensured provided that the Gram matrix built from kernel K is positive-definite. To enforce this property, a small ridge may be added to the diagonal. 2 Composite Kernel Learning trained with the mean of 896 kernels), 1 norm for MKL, and mixed-norms for the two versions of CKL: CKL1/2 and CKL1 . The number of channels and kernels selected by these algorithms is also reported. Table 1. Average Results for SVMs with 4 different penalization terms on the BCI datasets. to tackle a wide variety of problems. This formulation is not restricted to convex mixed-norm, a property that turns out to be of interest for reaching sparser, hence more interpretable solutions. Our approach is embedded, in the sense that the kernel hyper-parameters are optimized jointly with the parameters of classifier to minimize the soft-margin criterion. It is however implemented by a simple wrapper algorithm, for which the inner and the outer subproblems have the same objective function, and where the inner loop is a standard SVM problem. In particular, this implementation allows to use available solvers for kernel machines in the inner loop. Hence, although this paper considered binary classification problems, our approach can be readily extended to other learning problems, such as multiclass classification, clustering, regression or ranking. Algorithms SVM MKL CKL1/2 CKL1 AUC 83.87 ± 0.8 85.43 ± 0.9 85.49 ± 1.1 84.15 ± 0.8 # Channels 64 62.2 ± 1 62.9 ± 1 24.0 ± 4 # Kernels 896 255.8 ± 15 835.7 ± 25 60.9 ± 10 The prediction performances of the 4 algorithms are similar, with a slight advantage for sparse methods. CKL1/2 is much less sparse than MKL, which itself keeps about four times as much kernels compared to CKL1 . In the number of groups, MKL and CKL1/2 behave similarly, with only one or two channels removed. CKL1 is much sparser and removes about two thirds of the channels. Figure 6 represents the median relevance of the electrodes over the 10 experiments. It displays which electrodes have been selected by the different kernel learning methods. For one experiment, the relevance for channel is computed by the relative contribution of group to the norm of the solution, that is 1m fm 2 m , H Z m G 1 Acknowledgments This work was supported in part by the IST Program of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. The authors would like to thank N. Bourdaud and G. Garipelli for their help in the interpretation of the BCI experiments. References Bach, F. R., Lanckriet, G. R. G., & Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. Proceedings of the twenty-first international conference on Machine Learning (pp. 41­48). ¨ Blankertz, B., Muller, K.-R., Curio, G., Vaughan, T. M., ¨ Schalk, G., Wolpaw, J. R., Schlogl, A., Neuper, C., ¨ Pfurtscheller, G., Hinterberger, T., Schroder, M., & Birbaumer, N. (2004). The BCI competition 2003: progress and perspectives in detection and discrimination of EEG single trials. IEEE Trans. Biomed. Eng, 51, 1044­1051. Bonnans, J., & Shapiro, A. (1998). Optimization problems with pertubation: A guided tour. SIAM Review, 40, 228­ 264. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46, 131­159. Cristianini, N., Campbell, C., & Shawe-Taylor, J. (1999). Dynamically adapting kernels in support vector machines. Advances in Neural Information Processing Systems 11 (pp. 204­210). MIT Press. where Z is a normalization factor that sets the sum of relevances to one. The results for CKL1 are particularly neat, with high relevances for the electrodes in the areas of the visual cortex (especially the lateral electrodes PO7 and PO8 ), and the primary motor and Somatosensory cortex (C· and CPZ ). The scalp maps for MKL and CKL1/2 are very similar and show the importance of the same regions. In addition they also highlight numerous frontal electrodes that are not likely to be relevant for the BCI P300 Speller paradigm. 7. Conclusion and Further Works This paper is at the crossroad of kernel learning and variable selection. From the former viewpoint, we extended the multiple kernel learning problem to take into account the group structure among kernels. From the latter viewpoint, we generalized the hierarchical penalization framework to kernel classifiers by considering penalties in RKHS instead of parametric function spaces. As a side contribution, we also provide a smooth variational formulation for arbritrary mixed-norm penalties, enabling Composite Kernel Learning FP1 AF7 F7 FT7 F5 AF3 F3 FC3 C3 F1 FC5 C1 FPZ AFZ FZ FC1 CZ CPZ PZ P0Z 0Z IZ FP2 AF4 F2 FCZ C2 CP2 P2 F4 FC2 C4 AF8 F6 F8 FT8 FT7 F7 AF7 F5 FP1 AF3 F3 FC3 C3 F1 FC5 C1 FPZ AFZ FZ FC1 CZ CPZ PZ P0Z 0Z IZ FP2 AF4 F2 FCZ C2 CP2 P2 F4 FC2 C4 AF8 F6 F8 FT8 FT7 F7 AF7 F5 FP1 AF3 F3 FC3 C3 F1 FC5 C1 FPZ AFZ FZ FC1 CZ CPZ PZ P0Z 0Z IZ FP2 AF4 F2 FCZ C2 CP2 P2 F4 FC2 C4 AF8 F6 F8 FT8 FC5 C5 CP5 P7 P5 FC4 C6 FC5 C5 CP5 P7 P5 FC4 C6 FC5 C5 CP5 P7 P5 FC4 C6 T9 T7 T8 T10 T9 T7 T8 T10 T9 T7 T8 T10 TP7 CP3 CP1 P3 P03 01 P1 CP4 CP 6 TP 8 P4 P6 P08 P8 TP7 CP3 CP1 P3 P03 01 P1 CP4 CP 6 TP 8 P4 P6 P08 P8 TP7 CP3 CP1 P3 P03 01 P1 CP4 CP 6 TP 8 P4 P6 P08 P8 P07 P04 02 P07 P04 02 P07 P04 02 MKL CKL1/2 CKL1 Figure 1. Electrode median relevance for MKL (left), CKL1/2 (center) and CKL1 (right). The darker the color, the higher the relevance. Electrodes in white with a black circle are discarded (the relevance is exactly zero). The arrow represents the frontal direction. Cristianini, N., Shawe-Taylor, J., Elisseeff, A., & Kandola, K. (2002). On kernel-target alignment. Advances in Neural Information Processing Systems 14 (pp. 367­373). MIT Press. Grandvalet, Y., & Canu, S. (2003). Adaptive scaling for feature selection in SVMs. Advances in Neural Information Processing Systems 15 (pp. 569­576). MIT Press. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157­1182. Lanckriet, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui, L., & Jordan, M. I. (2004). Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5, 27­72. Rakotomamonjy, A., Bach, F., Canu, S., & Grandvalet, Y. (2007). More efficiency in multiple kernel learning. Proceedings of the 24th International Conference on Machine Learning (ICML 2007) (pp. 775­782). Omnipress. Rakotomamonjy, A., Guigue, V., Mallet, G., & Alvarado, V. (2005). Ensemble of SVMs for improving braincomputer interface P300 speller performances. 15th International Conference on Artificial Neural Networks (pp. 45­50). Springer. ¨ Scholkopf, B., & Smola, A. J. (2001). Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT Press. ¨ Schroder, M., Lal, T. N., Hinterberger, T., Bogdan, M., ¨ Hill, J., Birbaumer, N., Rosenstiel, W., & Scholkopf, B. (2005). Robust EEG channel selection across subjects for brain computer interfaces. EURASIP Journal on Applied Signal Processing, 19, 3103­3112. ¨ ¨ ¨ Sonnenburg, S., Ratsch, G., Schafer, C., & Scholkopf, B. (2006). Large scale multiple kernel learning. Journal of Machine Learning Research, 7, 1531­1565. Szafranski, M., Grandvalet, Y., & Morizet-Mahoudeaux, P. (2008). Hierarchical penalization. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in neural information processing systems 20. MIT Press. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 58, 267­288. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V. (2001). Feature selection for SVMs. Advances in Neural Information Processing Systems 13 (pp. 668­674). MIT Press. Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B, 68, 49­67. Zhao, P., Rocha, G., & Yu, B. (to appear). The composite absolute penalties family for grouped and hierarchical variable selection. Annals of Statistics.