Multiple Indefinite Kernel Learning with Mixed Norm Regularization

Matthieu Kowalski 1 MATTHIEU . KOWALSKI @ CMI . UNIV- MRS . FR Marie Szafranski 2 MARIE . SZAFRANSKI @ LIF. UNIV- MRS . FR Liva Ralaivola 2 LIVA . RALAIVOLA @ LIF. UNIV- MRS . FR 1 2 LATP ­ UMR CNRS 6166, LIF ­ UMR CNRS 6632, Universit´ de Provence, 13453 Marseille Cedex 13, France e

Abstract
We address the problem of learning classifiers using several kernel functions. On the contrary to many contributions in the field of learning from different sources of information using kernels, we here do not assume that the kernels used are positive definite. The learning problem that we are interested in involves a misclassification loss term and a regularization term that is expressed by means of a mixed norm. The use of a mixed norm allows us to enforce some sparsity structure, a particular case of which is, for instance, the Group Lasso. We solve the convex problem by employing proximal minimization algorithms, which can be viewed as refined versions of gradient descent procedures capable of naturally dealing with nondifferentiability. A numerical simulation on a U CI dataset shows the modularity of our approach.

not always possible. The idea of making use of several kernels is to take advantage of many sources of information, hoping a reliable algorithm can single out the useful ones. Being able to identify the relevant information in terms of data or kernels is also very important. To achieve this task, we propose a formulation of the learning problem that makes use of mixed norms as a regularizing tool. Mixed norms allow us to impose some kind of structure on the data and the kernels that we use, and we enforce our objective of automatically selecting the relevant information by using nondifferentiable ­ but still convex ­ mixed norms. Another way of viewing our approach is that of formalizing the problem of learning kernel classifiers as learning a representation of data based on (data-dependent) dictionaries. This is a common approach in the signal processing community, where efficient algorithms exist to handle nondifferentiable minimization problems as those we consider. We note that learning with several kernels is also closely related to the popular idea from signal processing to find representations of data from unions of dictionaries. The contributions of the present work are: a setting to learn multiple kernel classifiers with mixed norm regularization, a data-dependent bound on the generalization ability of the classifiers learned, a learning algorithm that instantiates the idea of proximal optimization methods, which provides a framework to build refined versions of gradient descent algorithms capable of dealing with nondifferentiability. The paper is organized as follows. Section 2 introduces the setting of learning multiple kernel classifiers with mixed norm regularization; insights as to why classifiers learned from the proposed setting should generalize well are given. Section 3 recalls the proximal optimization framework and derives the minimization algorithm to solve our learning problem. In Section 4, numerical simulations carried out on a dataset from the U CI repository show how the mixed norms can indeed induce desired sparsity. Section 5 discusses how our approach is related to other MKL strategies.

1. Introduction
Lately, there has been much attention paid to the problem of learning from multiple sources. This amount of work has been mainly spurred by new problems stemming from, e.g., bioinformatics or multimedia processing. The main line of approaches for this situation of learning is that of Multiple Kernel Learning (MKL) first initiated by Lanckriet et al. (2004), where the information provided by each data source at hand is encoded by means of a Mercer kernel. We address the problem of learning multiple indefinite kernel classifiers, where the kernels used to learn are not necessarily Mercer kernels. Our main motivation is that if Mercer kernels exhibit many interesting mathematical properties that make them particularly suitable to work with, encoding knowledge in terms of a positive definite kernels is
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Multiple Indefinite Kernel Learning with Mixed Norm Regularization

2. M IKL and Mixed Norms
2.1. Notational Conventions We use the following notation. Bold letters will usually denote column vectors. Let L  N and M  N; for a doubly indexed set of real coefficients ( m ), with 1   L and 1  m  M , ·m is the column vector ·m = [1m · · · Lm ] ,  · the column vector  · = [ 1 · · ·  M ] and  is the column vector  = [·1 · · · ·M ] . I(p) is such that I(p) = 1 if p is true and I(p) = 0 otherwise. The hinge function is denoted as |u|+ = (max(u, 0)) and for any real vector u, u 2 stands for + u 2 = k |uk |2 . + + 2.2. Setting We focus on the problem of supervised binary classification. We are interested in learning classifiers from a training sample S = {(xi , yi )}n of n labeled pairs i=1 (xi , yi ) from the product space Z = X × Y, where Y = {-1, +1} is the set of labels and X is the input space. These pairs are the realizations of n independent copies (X1 , Y1 ), . . . , (Xn , Yn ) of a random labeled variable (X, Y ) distributed according to an unknown and fixed distribution D on Z. With a slight abuse of notation S will also denote the random sample {(Xi , Yi )}n . i=1 The classifiers that we consider are multiple kernel classifiers where the kernels used are not necessarily positive definite kernels: we consider the largest possible definition and a kernel k is merely an element of RX ×X . Throughout the paper, we consider that we have at hand a set K = {k1 , . . . , k } of  kernels and multiple kernel classifiers are the signed versions of functions f from the sample-dependent family FS defined for a training set S = {(xi , yi )}n as 1 i=1   n,   FS = x  it kt (xi , x) :   Rn , kt  K .  
i,t=1

column vector associated with the doubly indexed set of coefficients (it ) 1in and ki is the ith column of
1t

K = [K1 . . . K ]  Rn ×n ,

(3)

where Kt is the matrix Kt = (kt (xi , xj ))1i,jn associated with kernel kt . Mixed norm · pq;1 is such that:  
pq;1 n  q/p

1/q  , (4)

=
i=1 t=1

|it |p

and ·

pq;2

such that: 
 n q/p

1/q  , (5)


pq;2

=
t=1 i=1

|it |p

(note that the order of summation has changed). As we discuss below, the choice for the values of p and q, if one is set to 1, induces different sparsity structures for the solution  of (2). The left-hand side of objective function (2) is the squared hinge loss, which is used by the 2-norm Support Vector Machines (Boser et al., 1992; Cortes & Vapnik, 1995). This loss is differentiable everywhere, a feature that will render some parts of the optimization procedure easy to derive. In addition, it is straightforward to see that the loss part is convex with respect to . The second term is a regularization part, which allows us to control the capacity of the class of classifiers considered. For the set of values that we consider for p and q, namely p, q  {1, 2}, the regularization part, i.e. the mixed norm to the qth is a convex function of ; the resulting objective function is henceforth convex in , since  > 0. Note however that this objective function becomes nondifferentiable as soon as p or q is equal to 1, which is the situation of interest to us since such a choice induces sparsity on the optimal . This nondifferentiability is nicely handled by the proximal minimization algorithm that we derive. 2.3. Expected Sparsity Structure The minimization problems (2) that we focus on will use mixed norms such that p = 1 or q = 1, whichever r. The reason why we retain this setting is because it may induce sparsity on the optimal . Such sparsity is useful from two different points of view. On the one hand, it may help identify which of the different kernels used are indeed informative for the problem at hand by, e.g., setting all the coefficients of  related to a specific kernel kt to 0, in which case ·t = 0, or by setting all the coefficients of  related to one xi to 0, which corresponds to i· = 0. On the other

(1) Thus, the output predicted for x by f  FS is sign(f (x)). With this setting, learning a classifier from S comes down to the problem of finding a vector  that entails a low generalization error. To this end, we propose to set  as the solution of the following penalized optimization problem:
n R

min n
i=1

1 - yi k i 

2 +

+

  q

q pq;r

(2)

for  > 0, p, q  {1, 2} and r  {1, 2}. Here,  is the
1 A bias term is taken into account by adding a constant kernel k0 to K such that k0 (x, x ) = 1, x, x  X .

Multiple Indefinite Kernel Learning with Mixed Norm Regularization


K1 K2 K3
(a) ·
12;1

K1 K2 K3
(b) ·
21;1

K1 K2 K3
(c) ·
12;2

K1 K2 K3
(d) ·
21:2

Figure 1. Expected sparsity structure of  for different norms · pq;r used, where white squares in  correspond to 0 coefficients. When r = 1 ((a) and (b)), the sparseness is defined with respect to the i· vectors, which are indicated with one color; the corresponding column of K that are used by the resulting function are shown using the same colors. When r = 2 ((c) and (d)), the sparseness is defined with respect to the kernels. See text for detailed explanations.
 complexity Rn (FS ) of a sample-dependent hypothesis class FS is defined as  Rn (FS ) =

hand, sparseness, or more precisely, the use of a 1 -like penalization (which is the case if p or q is equal to 1) is useful to draw generalization bounds as we show just below. The structure of the sparsity of , depends not only on which of p or q is equal to 1 but also on the value of r. We may summarize the expected pattern of sparsity as follows. If p = q = 1 then  pq;1 =  pq;2 , for all . A large number of coefficients it are expected to be 0, as with the Lasso (Tibshirani, 1996). If r = 1, the sparseness is related to the xi 's. If p = 1 and q = 2, each xi only uses a limited number of kernels, or in other words, the i· 's are sparse (see Figure 1(a)). If p = 2 and q = 1, then we may expect several i· to be zero, meaning that the kernel expansion of the decision function is based on few training vectors xi only; these vectors may be thought of as 'support vectors' (see Figure 1(b)). If r = 2, the sparseness is related to the kernels used. If p = 1 and q = 2, then the vectors ·t are expected to be sparse and only few xi 's are activated per kernel (see Figure 1(c)). If p = 2 and q = 1 then some kernels are expected to be discarded and not used in the decision function learned: some vectors ·t are expected to be 0 (see Figure 1(d)). For r  {1, 2}, · 12;r is related to the Elitist-Lasso of Kowalski and Torr´ sani (2008) while · 21;r is related to e the Group-Lasso of (Yuan & Lin, 2006). 2.4. A Data-Dependent Generalization Bound Here, we give insights as to why a classifier learned by solving (2) may generalize and we provide a datadependent bound on the generalization error for such a classifier. This bound relies on a recent and elegant result about the generalization ability of classifiers drawn from sampledependent classes presented by Ben-David et al. (2008), a particular case of which focuses on a generalized notion of Rademacher complexity that we recall here. Definition 1 (Ben-David et al. (2008)). The Rademacher

sup
S={(xi ,yi )}n i=1

E

sup
f FS

1 n

n

i f (xi ) ,
i=1

where  is a vector of n independent Rademacher vari1 ables, i.e. P(i = 1) = P(i = -1) = 2 , i = 1, . . . , n. Ben-David et al. (2008) provide the following result. Theorem 1. Assume that S, S   Z i , S  S  i=1 FS  FS . For all distributions D on Z, n > 0, then with probability at least 1 -  over the random draw of S = {(Xi , Yi )}n the following holds: f  FS , i=1
r
 ^ PD (Y f (X)  0)  En (Y f (X)) + 16R2n (FS ) +

log 1/ , 2n

where () = min(|1 - |2 , 1) is the clipped squared + n 1 ^ hinge loss and En (Y f (X)) = n i=1 (Y f (Xi )). Note that we have slightly modified the result of Ben-David et al. (2008) so that it takes into account the squared hinge loss. To do that, we have used a structural result on the Rademacher complexity of classes of composite functions given by Bartlett and Mendelson (2002). Let us now consider that p, q, r and  are fixed. We define the set An ()  Rn as An () = { :   Rn , 
pq;r

 }

and the sample-dependent hypothesis class FS () as, for S = {(xi , yi )}n i=1
( FS () = x
n, X i,t=1

) it kt (xi , x),   An () .

Theorem 1 applies as soon as an upper bound on the sample-dependent Rademacher complexity of the hypothesis class under consideration can be computed. A bound on  R2n (FS ()) therefore suffices to bound the generalization error of the classifier learned through (2). The following proposition provides such a bound.

Multiple Indefinite Kernel Learning with Mixed Norm Regularization

Proposition 1.  > 0, n  N,
 R2n (FS ())  

sup
S={(xi ,yi )}2n i=1

E

1 KS  2n

p q ;r

,

Here, S2n denotes an i.i.d sample of size 2n. The dimensions of KS2n and  follow accordingly. Noting that q 1/q for any vector , we have,  pq;1 = i i· p dropping the 2n subscript for sake of clarity:
K
2,;1

where KS is defined as in (3) with respect to the 2n-sample 1 1 1 S and p + p = 1 and 1 + q = 1. q Proof. Assume that S is a fixed sample of size 2n and  a fixed vector of {-1, +1}2n . Then
sup
2n X

= sup
1i2n

[K]i· [K]i·
 X

2

 sup
1i2n

1

( 

2

  1 , )

= sup

|[K]it |

1i2n t=1

i f (xi ) = sup

sup

2n  XX

jt

2n X i=1

i kt (xi , xj )

f FS () i=1

A2n () j=1 t=1

=

 K =
p q ;r ,

sup
:  pq;r /n

 (K)

A2n ()

  K

  2n   X X   kt (xi , xj )j  = sup   1i2n t=1  j=1  2n   X  X    sup  kt (xi , xj )j  .  1i2n 
t=1 j=1

where Holder's inequality has been used twice to get the last line. Taking the expectation with respect to  and then the supremum over S ends the proof. This allows us to state the following theorem. Theorem 2. For all distributions D on Z, n > 0,  > 0, with probability at least 1 -  over the random draw of S = {(Xi , Yi )}n , f  FS (), i=1
^ PD (Y f (X)  0)  En (Y f (X))

Now, for fixed t, we can apply Massart's finite class lemma (see appendix) to the 2n 2n-dimensional vectorsvi = [kt (xi , x1 ) · · · kt (xi , x2n )] of length vi  K 2n:
  r 2n  X ln 4n   kt (xi , xj )j   K E sup  ,  n i 
j=1

which concludes the proof.

where S2n

This proposition establishes a (simple) condition so that the  bound of Theorem 2 displays a 1/ n factor only for sper , , log 1/ cific values of p, q and r. Finding a more general condition 1 , , + , + 16 sup E ,KS2n  , 2n 2n p q ;r for such a factor to be present in the bound for any comS2n bination of p, q and r is the subject of ongoing research on denotes a sample of size 2n. our part; Besov spaces are a possible direction.

Proof. Straightforward using Theorem 1, Proposition 1 and noting that S  S  FS ()  FS (). This theorem is useful as soon as the kernels used imply 1 supS2n E 2n KS2n  p q ;r = O(n- ) for  > 0. The following proposition gives an example of a simple condition on the kernels used to be in that situation for some values of p , q and r. Proposition 2. Let D be a distribution on Z. If K  R such that P(kt (X, X )  K ) = 1, t = 1, . . . , , then sup E
S2n

3. Algorithms
3.1. Proximal algorithms This section synthetically describes the proximal framework used to solve problem (2). Proximal algorithms deal with general problems taking the form of min f1 () + f2 () ,


(6)

1 KS2n  2n

p q ;r

  K

ln 4n , n

where f1 and f2 are convex and lower semicontinuous functions. Resolving such kind of problem relies on proximity operators, introduced by Moreau (1965). More details on the proximal framework can be found in the work of Combettes and Pesquet (2007).

for (p , q , r)  {(, , 1), (, , 2), (, 2, 1), (2, , 2)}, Definition 2 (Proximity operator). Let  : RP  R be a lower semicontinuous, convex function. The proximity i.e., for (p, q, r)  {(1, 1, 1), (1, 1, 2), (1, 2, 1), (2, 1, 2)}. operator prox : RP  RP associated with  is given by Proof. We just give the proof for p = 2, q =  and 1 r = 1, i.e., p = 2, q = 1 and r = 1. The other cases prox (u) = argmin u -  2 + () . 2 2 P R can be obtained along the same lines.

Multiple Indefinite Kernel Learning with Mixed Norm Regularization Algorithm 1 Forward-backward proximal algorithm input Ky , , with  < 2/ initialize (0)  Rn (for instance 0) repeat " " (s+1) = proxf1 (s) -   f2 () until convergence Algorithm 2 Forward-backward for squared hinge loss
T input Ky , , , with  < 2/ Ky Ky (0) n initialize   R repeat " " T (s+1) = prox . q (s) + Ky [1 - Ky ]+ pq until convergence

When f1 is convex lower semicontinuous, and f2 is differentiable with f2 -Lipschitz, then Problem (6) can be solved with Algorithm 1. Combettes and Wajs (2005), and more recently Combettes and Pesquet (2007), show that this algorithm converges to a minimum of Problem (6). 3.2. Proximity operators Here, we are interested in proximity operators related to mixed norms (Kowalski, 2008). In Problem (6), the mixed norm penalty f2 () = q -1   q , with p, q  1, is a pq convex lower semicontinuous function, nondifferentiable in 0. Furthermore,  f2 () is only -Lipschtiz when p, q  {1, 2}. We thus limit the study of proximity operators for the norms · 1 , · 2 , · 12 , · 21 . Proposition 3 (Proximity operators for mixed norms). Let u  RP be indexed by ( , m) and  > 0. The proximity operators for  · pq , with p, q  {1, 2}, defined by ^  = prox
.
q p,q

where u ,m denotes the |u ,m | ordered by descend ing order for fixed , and the quantity M is the number such that
M +1

u  and

,M +1


m =1

( u

,m

-u 

,M +1 )

,

M

u 

,M

>
m =1

( u

,m

- uk,M ) . 

3.3. Solving Problem (2) with Proximal Optimization The squared hinge loss can be restated in matrix form as 1 - Ky  2 , where Ky = diag([y1 , . . . , yn ]) K . In the + previous section, we have shown how to compute the proximity operators for · pq;r norms. Let us remind that f1 () = 1 - Ky  2 , is differentiable with gradient + Lipschitz, while f2 () = q -1   q , with p, q  {1, 2}, pq is a convex lower semicontinuous functions, nondifferentiable in 0. Thus, we can use the forward-backward strategy given in Algorithm 1 to solve Problem (2). To do so, it T suffices to compute 1 - Ky  2 = -Ky 1 - Ky  + , + T which is -Lipschitz with  = Ky Ky . The resulting procedure for Problem (6) is given in Algorithm 2.

(u) = argmin
RP

1 u- 2

2 2

+

  q

q p,q

,

are given coordinate-wise for each ( , m) by: · when p = q = 1,  ^
,m

= sign(u

,m ) ||u ,m |

- |+ ,

which is the well-known soft-thresholding operator; · when p = 2 and q = 2,  ^
,m

4. The Good, the Bad, and the Ugly: a Numerical Illustration
In this section, our aim is to exhibit the effects of regularization when using a structure on kernels or data. The structure is introduced in Problem (2) by mixed norms · pq;r , with p, q  {1, 2} as explained in section 2. An in-depth study concerning the predictive performances using actual indefinite kernels will be adressed in a longer version of this paper.

=

1 u 1+

,m

;

· when p = 2 and q = 1,  ^
,m

=u

,m

1-

 u
· 2 +

;

· when p = 1 and q = 2,
M

Here, we compare Algorithm 2 for different regularization terms, with regard to the sparsity behavior. The comparison is done on the Titanic dataset, provided by Gunnar R¨ tsch. 2 a This binary classification problem consists of 150 training and 2051 test examples.
,m

  ^
,m

u 
m =1

= sign(u

,m )

|u

,m |

-

(1 + M ) u

,
· 2 +

We have designed a global kernel matrix, composed of three kernels, chosen so that a classifier obtains Good, Bad,
2

http://ida.first.fraunhofer.de/projects/bench/

Multiple Indefinite Kernel Learning with Mixed Norm Regularization

and Ugly performances, according to the state of the art. More precisely, we have K = [KG , KB , KU ], where: · KG , a linear kernel, is the Good guy; · KB , a Gaussian kernel of width 0.1, is the Bad guy; · KU , a Gaussian kernel of width 100, is the Ugly guy. As baseline performances, the lowest test errors achieved with Algorithm 2, using the · 2 norm, with kernels KG , KB and KU taken separately, are respectively 21.84%, 25.89% and 32.57% In Figure 2, we compare the influence of the different regularizations, with K = [KG , KB , KU ]. Here, the parameter  has been chosen by a 5-fold cross-validation procedure, between logarithmically spaced values varying from 1 to 105 . The classification error rate, which is 22.92% for the · 21;1 norm, and 21.84% for the other norms, was computed on the test set. · The use of the norm · 2 does not single out either of KG or KB . Ugly's coefficients are all smoothed towards zero. According to the nature of the · 2 penalization, there is no sparsity induced. · Contrary to the · 2 regularization, the · 1 norm introduces sparsity everywhere. The most influent coefficients belong to KG , and only few of them are nonzero. Even if many coefficients related to the Ugly kernel are nonzero, they still remain small in magnitude. · The · 21;2 norm, which focuses on the kernel structure, identifies quite well the Good kernel, giving to the corresponding coefficients values close to one. Even though it is not discarded, an insignificant relevance is assigned to the Ugly kernel. · The · 12;2 penalization behaves similarly as the · 1 norm. However, one can see in the Bad kernel that some coefficients have a higher importance, which is consistent with the nature of the norm. Indeed, it is expected that within each relevant kernel, the penalization puts more weight on the most relevant elements. · The · 21;1 regularization is supposed to put emphasis on the most relevant observations, whatever the kernel, and to eliminate the others. In that sense, the remaining coefficients can be envisioned as support vectors. This is quite well illustrated on Figure 2. · Finally, for all data, the · 12;1 norm identifies the most significant kernels for the classification task. It is worth noting that there are few contiguous lines: for numerous observations, only one kernel is selected.

One can note that the Ugly kernel is involved in the solutions related to the · 1 and · 21;2 norms, which could appear inconsistent. An insight concerning the presence of the Ugly kernel is that  was chosen through crossvalidation based on the generalization error. As Leng et al. (2004) showed for the Lasso, this might be not optimal in terms of kernel selection. A slight increase of  allowed us to discard the Ugly kernel (when using the · 21;2 norm), or to significantly reduce its influence (when using the · 1 norm).

5. Discussion
The formulation of the MKL problem by Bach et al. (2004) may be seen as the kernelization of the Group-Lasso, considering penalties on elements ft from several RKHS Ht , in a standard SVM problem.Rakotomamonjy et al. (2007) tackled the dual formulation. It consists of explicitly optimizing a convex combination of kernels, which defines the  actual SVM kernel K(x, x ) = t=1 t Kt (x, x ), where Kt is the reproducing kernel of Ht , and t the coefficients to be learned under a 1 constraint. MKL involves a kernel which is a convex combination of candidate kernels, where the coefficients of the less relevant ones are shrinked towards zero. In that sense, using · 21;2 in Problem (2) is closely related to MKL, as it induces sparsity in the kernels. We may note that MKL not only enforces sparsity in kernels but also with respect to data, since it essentially is a SVM problem and thus produces (few) support vectors. To achieve such a joint sparsity in our framework, we would have to sacrifice the convexity of · pq;r , by choosing p, q  1. Another difference with MKL is that we do not have any notion of `norm' of f ; instead we control the (mixed) norm of synthesis coefficients  m in the frame generated by the kernels. This perspective is closely related with the idea of expansion with respect to a dictionary of -lets (such as wavelets) in signal processing.

6. Conclusion and Further Work
We have proposed a mixed norm setting to learn multiple kernel classifiers. On the contrary to a common assumption on kernel positive definiteness, our framework is still valid when using indefinite kernels. The learning problem that tackle can be formulated as a convex optimization problem and the desired sparsity structure can be enforced by the choice of the mixed norm used, at the price of rendering the optimization problem nondifferentiable. To cope with this nondifferentiability, we derive an optimization algorithm stemming from the proximal framework. Simulations showing the modularity of our approach are provided.

Multiple Indefinite Kernel Learning with Mixed Norm Regularization

This work raises several open problems. First, we would like to provide more extensive numerical simulations, especially with indefinite or nonsymmetric kernels. The primary application we have in mind is image categorization, where the kernels are based on Kullback Leibler divergences between probability density functions derived from wavelet decompositions (Piro et al., 2008). We also plan to investigate the possibility of analytically computing the regularization path for . Finally, we have been working on two extensions of our optimization procedure, namely (a) a chunking strategy (Osuna & Girosi, 1997) to make a better use of the resulting sparseness and (b) the adaptation of latest advances in convex optimization introduced by Nesterov (2007) to our learning problem.

Boser, B., Guyon, I., & Vapnik, V. (1992). A Training Algorithm for Optimal Margin Classifiers. Proc. 5th Workshop Comp. Learn. Theory. Combettes, P. L., & Pesquet, J.-C. (2007). A DouglasRachford splitting approach to nonsmooth convex variational signal recovery. IEEE J. Select. Top. Sig. Proc., 1, 564­574. Combettes, P. L., & Wajs, V. R. (2005). Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation, 4, 1168­1200. Cortes, C., & Vapnik, V. (1995). Support Vector Networks. Machine Learning, 20, 1­25. Kowalski, M. (2008). Sparse regression using mixed norms (Technical Report). LATP, Marseille, France. http: //hal.archives-ouvertes.fr/hal-00202904/. Kowalski, M., & Torr´ sani, B. (2008). Sparsity and persise tence: mixed norms provide simple signals models with dependent coefficients. Signal, Image and Video Processing. doi:10.1007/s11760-008-0076-1. Lanckriet, G., Cristianini, N., Bartlett, P., El Ghaoui, L., & Jordan, M. (2004). Learning the Kernel Matrix with Semidefinite Programming. J. Mac. Learn. Res., 5, 27­ 72. Leng, C., Lin, Y., & Wahba, G. (2004). A note on lasso and related procedures in model selection. Statistica Sinica, 16, 1273­1284. Moreau, J.-J. (1965). Proximit´ et dualit´ dans un espace e e hilbertien. Bull. Soc. Math. France, 93, 273­299. Nesterov, Y. (2007). Gradient methods for minimizing composite objective function (Technical Report). Universit´ e Catholique de Louvain. CORE discussion paper. Osuna, E., & Girosi, F. (1997). Improved Training Algorithm for Support Vector Machines. Neural Networks for Signal Processing (pp. 276­285). IEEE Press. Piro, P., Anthoine, S., Debreuve, E., & Barlaud, M. (2008). Sparse multiscale patches for image processing. Emerging Trends in Visual Computing (pp. 284­304). Rakotomamonjy, A., Bach, F., Canu, S., & Grandvalet, Y. (2007). More efficiency in Multiple Kernel Learning. Proc. 24th Int. Conf. Mac. Learn. (ICML 2007) (pp. 775­ 782). Omnipress. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B, 58, 267­288. Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. Roy. Stat. Soc. B, 68, 49­67.

Acknowledgments
MS and LR are partially supported by the IST Program of the European Community, under the FP7 Pascal 2 Network of Excellence ICT-216886-NOE.

Appendix
Theorem 3 (Holder's Inequality). Let p, q  1 and n  N. 1 If p + 1 = 1 then q
n n
1 p

n

1 q

u, v  R ,
i=1

n

|ui vi | 
i=1

|ui |

p i=1

|vi |

q

.

Lemma 1 (Massart's finite class lemma). Let A be a finite subset of Rn with each vector x = [x1 · · · xn ] in A having norm bounded by r = max x 2 . If  is an n-dimensional vector of independent Rademacher variables, then
" E #  n p   r 2 ln 2|A| 1 X  sup  xi i   .  n xA n  i=1

(We have slightly changed the statement from its original form to take the absolute value into account.)

References
Bach, F. R., Lanckriet, G. R. G., & Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. Proc. 21th Int. Conf. Mac. Learn. (ICML 2004) (pp. 41­48). Bartlett, P. L., & Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. J. Mac. Learn. Res., 3, 463­482. Ben-David, S., Rahimi, A., & Srebro, N. (2008). Generalization Bounds for Indefinite Kernel Machines. Nips*08 Workshop: New Challenges in Theoretical Machine Learning.

Multiple Indefinite Kernel Learning with Mixed Norm Regularization

No structure

·

2

·

1

Kernel structure

·

21;2

·

12;2

Data structure

·

21;1

·

12;1

Figure 2. Relevance maps of ·t , with t  {G, B, U }, for different norms. The coefficients have been normalized, so that it  [0, 1]. Top: no structure defined. Middle: structure defined with respect to the kernels. Bottom: structure defined with respect to the data.