Two-Stage Learning Kernel Algorithms

Corinna Cortes Google Research, 76 Ninth Avenue, New York, NY 10011.

CORINNA @ GOOGLE . COM

Mehryar Mohri MOHRI @ CIMS . NYU . EDU Courant Institute of Mathematical Sciences and Google Research, 251 Mercer Street, New York, NY 10012. Afshin Rostamizadeh Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012.
ROSTAMI @ CS . NYU . EDU

Abstract
This paper examines two-stage techniques for learning kernels based on a notion of alignment. It presents a number of novel theoretical, algorithmic, and empirical results for alignmentbased techniques. Our results build on previous work by Cristianini et al. (2001), but we adopt a different definition of kernel alignment and significantly extend that work in several directions: we give a novel and simple concentration bound for alignment between kernel matrices; show the existence of good predictors for kernels with high alignment, both for classification and for regression; give algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP; and report the results of extensive experiments with this alignment-based method in classification and regression tasks, which show an improvement both over the uniform combination of kernels and over other state-of-the-art learning kernel methods.

accurate predictor. This is a problem that has attracted a lot of attention recently, both from the theoretical point of view and from the algorithmic, optimization, and application perspective. Different kernel families have been studied in the past, including hyperkernels (Ong et al., 2005), Gaussian kernel families (Micchelli & Pontil, 2005), or non-linear families (Bach, 2008; Cortes et al., 2009b). Here, we consider more specifically a convex combination of a finite number of kernels, as in much of the previous work in this area. On the theoretical side, a number of favorable guarantees have been derived for learning kernels with convex combinations (Srebro & Ben-David, 2006; Cortes et al., 2009a), including a recent result of Cortes et al. (2010) which gives a margin bound for L1 regularization with only a logarithmic dependency on p, the number of kernels p: R(h)  R (h)+O( (R2 /2 )(log p)/m). Here, R denotes the radius of the sphere containing the data,  the margin, and m the sample size. In contrast, the results obtained for learning kernels in applications have been in general rather disappointing. In particular, achieving a performance superior to that of the uniform combination of kernels, the simplest approach requiring no additional learning, has proven to be surprisingly difficult (Cortes, 2009). Most of the techniques used in these applications for learning kernels are based on the same natural one-stage method, which consists of minimizing an objective function both with respect to the kernel combination parameters and the hypothesis chosen, as formulated by Lanckriet et al. (2004). This paper explores a two-stage technique and algorithm for learning kernels. The first stage of this technique consists of learning a kernel K that is a convex combination of p kernels. The second stage consists of using K with a standard kernel-based learning algorithm such as support vector machines (SVMs) (Cortes & Vapnik, 1995) for

1. Introduction
Kernel-based algorithms have been used with great success in a variety of machine learning applications (Shawe-Taylor & Cristianini, 2004). But, the choice of the kernel, which is crucial to the success of these algorithms, has been traditionally entirely left to the user. Rather than requesting the user to select a specific kernel, learning kernel algorithms require the user only to specify a family of kernels. This family of kernels can be used by a learning algorithm to form a combined kernel and derive an
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Two-Stage Learning Kernel Methods

classification, or KRR for regression, to select a prediction hypothesis. With this two-stage method we obtain better performance than with the one-stage methods on several datasets. Note that an alternative two-stage technique consists of first learning a prediction hypothesis hk using each kernel Kk , and then learning the best linear combination of these hypotheses. But, such ensemble-based techniques make use of a richer hypothesis space than the one used by learning kernel algorithms such as (Lanckriet et al., 2004). Different methods can be used to determine the convex combination parameters defining K from the training sample. A measure of similarity between the base kernels Kk , k  [1, p], and the target kernel KY derived from the labels can be used to determine these parameters. This can be done by using either the individual similarity of each kernel Kk with KY , or globally, from the similarity between convex combinations of the base kernels and KY . The similarities we consider are based on the natural notion of kernel alignment introduced by Cristianini et al. (2001), though our definition differs from the original one. We note that other measures of similarity could be used in this context. In particular, the notion of similarity suggested by Balcan & Blum (2006) could be used if it could be computed from finite samples. We present a number of novel theoretical, algorithmic, and empirical results for the alignment-based two-stage techniques. Our results build on previous work by Cristianini et al. (2001; 2002); Kandola et al. (2002a), but we significantly extend that work in several directions. We discuss the original definitions of kernel alignment by these authors and adopt a related but different definition (Section 2). We give a novel concentration bound showing that the difference between the alignment of two kernel matrices and the alignment of the corresponding kernel  functions can be bounded by a term in O(1/ m) (Section 3). Our result is simpler and directly bounds the difference between the relevant quantities, unlike previous work. We also show the existence of good predictors for kernels with high alignment, both for classification and for regression. These results correct a technical problem in classification and extend to regression the bounds of Cristianini et al. (2001). In Section 4, we also give an algorithm for learning a maximum alignment kernel. We prove that the mixture coefficients can be obtained efficiently by solving a simple quadratic program (QP) in the case of a convex combination, and even give a closed-form solution for them in the case of an arbitrary linear combination. Finally, in Section 5, we report the results of extensive experiments with this alignment-based method both in classification and regression, and compare our results with L1 and L2 regularized learning kernel algorithms (Lanckriet et al., 2004;

Cortes et al., 2009a), as well as with the uniform kernel combination method. The results show an improvement both over the uniform combination and over the one-stage kernel learning algorithms in all datasets. We also observe a strong correlation between the alignment achieved and performance.

2. Alignment definitions
The notion of kernel alignment was first introduced by Cristianini et al. (2001). Our definition of kernel alignment is different and is based on the notion of centering in the feature space. Thus, we start with the definition of centering and the analysis of its relevant properties. 2.1. Centering kernels Let D be the distribution according to which training and test points are drawn. Centering a feature mapping  : X  H consists of replacing it by  - Ex [], where Ex denotes the expected value of  when x is drawn according to the distribution D. Centering a positive definite symmetric (PDS) kernel function K : X × X  R consists of centering any feature mapping  associated to K. Thus, the centered kernel Kc associated to K is defined for all x, x  X by Kc (x, x ) = ((x) - E[]) ((x ) - E[])
x x

=K(x, x ) - E[K(x, x )] - E[K(x, x )] + E [K(x, x )].
x x x,x

This also shows that the definition does not depend on the choice of the feature mapping associated to K. Since Kc (x, x ) is defined as an inner product, Kc is also a PDS kernel. Note also that for a centered kernel Kc , Ex,x [Kc (x, x )] = 0. That is, centering the feature mapping implies centering the kernel function. Similar definitions can be given for a finite sample S = (x1 , . . . , xm ) drawn according to D: a feature vector (xi ) with i [1, m] is then centered by replacing it with 1 (xi )-, with  = m m (xi ), and the kernel matrix i=1 K associated to K and the sample S is centered by replacing it with Kc defined for all i, j  [1, m] by [Kc ]ij = Kij - 1 m
m i=1

Kij -

1 m

m

Kij +
j=1

1 m2

m

Kij .
i,j=1

Let  = [(x1 ), . . . , (xm )] and  = [, . . . , ] . Then, it is not hard to verify that Kc = (-)(-) , which shows that Kc is a positive semi-definite (PSD) matrix. m 1 Also, as with the kernel function, m2 i,j=1 [Kc ]ij = 0. 2.2. Kernel alignment We define the alignment of two kernel functions as follows.

Two-Stage Learning Kernel Methods

0.5 0.0

 -1

1- +1

Alignment

Definition 1. Let K and K be two kernel functions defined 2 over X ×X such that 0 < E[Kc ] < + and 0 < E[Kc 2 ]< +. Then, the alignment between K and K is defined by (K, K ) = E[Kc Kc ]
2 E[Kc ] E[Kc 2 ]

Distribution, D
1.0

1.0 0.9 0.8 0.7 0.6 0.5

.

-0.5 -1.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

This paper. Cristianini et al.

0.0

0.2

0.4

0.6

0.8

1.0


In the absence of ambiguity, to abbreviate the notation, we often omit the variables over which an expectation is 2 2 taken. Since | E[Kc Kc ]|  E[Kc ] E[Kc ] by the CauchySchwarz inequality, we have (K, K )  [-1, 1]. The following lemma shows more precisely that (K, K )  [0, 1] when Kc and Kc are PDS kernels. We denote by ·, · F the Frobenius product and by · F the Frobenius norm. Lemma 1. For any two PDS kernels Q and Q , E[QQ ] 0. Proof. Let  be a feature mapping associated to Q and  a feature mapping associated to Q . By definition of  and  , and using the properties of the trace, we can write:
x,x

Figure 1. Alignment values computed for two different definitions 2 1 of alignment: A = [ 1+(1-2) ] 2 in black,  = 1 in blue. In this 2 simple two-dimensional example, a fraction  of the points are at (-1, 0) and have the label -1. The remaining points are at (1, 0) and have the label +1.

which are thus in terms of K and K instead of Kc and Kc and similarly for matrices. This may appear to be a technicality, but it is in fact a critical difference. Without that centering, the definition of alignment does not correlate well with performance. To see this, consider the standard case where K is the target label kernel, that is K (x, x ) = yy , with y the label of x and y the label of y , and examine the following simple example in dimension two (X = R2 ), where K(x, x ) = x · x +1 and where the distribution, D, is defined by a fraction   [0, 1] of all points being at (-1, 0) and labeled with -1, and the remaining points at (1, 0) with label +1. Clearly, for any value of   [0, 1], the problem is separable for example with the simple vertical line going through the origin and one would expect the alignment to be 1. However, the alignment A is never equal to one except for  = 0 or  = 1 and, even  the balanced case where  = 1/2, for its value is A = 1/ 2  .707 < 1. In contrast, with our definition, (K, K ) = 1 for all   [0, 1], see Figure 1. This mismatch between A (or A) and the performance values can also be frequently seen in experiments. Our empirical results in several tasks (not included due to lack of space) show that A measured on the test set does not correlate well with the performance achieved. Instances of this problem have also been noticed by Meila (2003) and Pothin & Richard (2008) who have suggested various (input) data translation methods, and by Cristianini et al. (2002) who observed an issue for unbalanced data sets. The definitions we are adopting are general and require centering for both kernels K and K . The notion of alignment seeks to capture the correlation between the random variables K(x, x ) and K (x, x ) and one could think it natural, as for the standard correlation coefficients, to consider the following definition:  (K, K ) = E[(K - E[K])(K - E[K ])] . E[(K - E[K])2 ] E[(K - E[K ])2 ]

E [Q(x, x )Q (x, x )] = E [(x) (x ) (x )  (x)]
x,x

= E
x

x,x

Tr[(x) (x ) (x )  (x)]
x F

= E[(x) (x) ], E[(x ) (x ) ] where U = Ex [(x) (x) ].

= U

2 F,

The following similarly defines the alignment between two kernel matrices K and K based on a finite sample S = (x1 , . . . , xm ) drawn according to D. Definition 2. Let K  Rm×m and K  Rm×m be two kernel matrices such that Kc F = 0 and Kc F = 0. Then, the alignment between K and K is defined by (K, K ) = Kc , Kc F . Kc F Kc F

Here too, by the Cauchy-Schwarz inequality, (K, K )  [-1, 1] and in fact (K, K )  0 since the Frobenius product of any two positive semi-definite matrices K and K is non-negative. Indeed, for such matrices, there exist matrices U and V such that K = UU and K = VV . The statement follows from K, K
F

= Tr(UU VV ) = Tr (U V) (U V)  0.

Our definitions of alignment between kernel functions or between kernel matrices differ from those originally given by Cristianini et al. (2001; 2002): A= E[KK ] E[K 2 ] E[K 2 ] A= K, K F , K F K F

However, centering the kernel values is not directly relevant to linear predictions in feature space, while our definition

Two-Stage Learning Kernel Methods

of alignment, , is precisely related to that. Also, as already shown in Section 2.1, centering in the feature space implies the centering of the kernel values, since E[Kc ] = 0 and m 1 i,j=1 [Kc ]ij = 0 for any kernel K and kernel matrix m2 K. Conversely, however, centering of the kernel does not imply centering in feature space.

a = (1/m2 ) Kc 2 . By Proposition 1 and the union bound, for any  > 0, with probability at least 1 - , all three differences a - a, a - a , and b - b are bounded by  = 18R + 24R4 m , we can write:
4

log 6  2m

. Using the definitions of  and

3. Theoretical results
This section establishes several important properties of the alignments  and its empirical estimate : we give a con centration bound of the form | - |  O(1/ m), and show the existence of good prediction hypotheses both for classification and regression, in the presence of high alignment. 3.1. Concentration bound Our concentration bound differs from that of Cristianini et al. (2001) both because our definition of alignment is different and because we give a bound directly on the quantity of interest | - |. Instead, Cristianini et al. give a bound on |A -A|, where A =A can be defined from A by replacing each Frobenius product with its expectation over samples of size m. The following proposition gives a bound on the essential quantities appearing in the definition of the alignments. The proof is given in a longer version of this paper. Proposition 1. Let K and K denote kernel matrices associated to the kernel functions K and K for a sample of size m drawn according to D. Assume that for any x  X , K(x, x)  R2 and K (x, x)  R2 . Then, for any  > 0, with probability at least 1-, the following inequality holds: Kc , Kc m2
F

|(K, K ) - (K, K )| = 

  b b b aa - b aa  - = aa aa aa aa    (b - b) aa - b( aa - aa )  = aa aa (b - b) aa - aa   =  - (K, K )  . aa aa ( aa + aa )

Since (K, K )  [0, 1], it follows that |b - b| |aa - aa |   |(K, K ) - (K, K )|   + . aa aa ( aa + aa ) Assume first that a  a . Rewriting the right-hand side to make the differences a - a and a - a appear, we obtain: |(K, K ) - (K, K )| |b - b| |(a - a)a + a(a - a )|  +     aa aa ( aa + aa )  a +a    1+  aa aa + aa a a  1+  +   aa aa aa    aa We can similarly obtain 2+
2 aa

- E[Kc Kc ] 

18R4 + 24R4 m

log . 2m

2 

a a +
1 a

= 

2 1 . + a aa

Theorem 1. Under the assumptions of Proposition 1, and further assuming that the conditions of the Definitions 1-2 are satisfied for (K, K ) and (K, K ), for any  > 0, with probability at least 1-, the following inequality holds: 3 +4 |(K, K ) - (K, K )|  18 m
2 with  = max(R4/E[Kc ], R4/E[Kc ]). 2

 bounds are less than or equal to 3max(  , a ). a

 when a  a. Both

3.2. Existence of good predictors For classification and regression tasks, the target kernel is based on the labels and defined by KY (x, x ) = yy , where we denote by y the label of point x and y that of x . This section shows the existence of predictors with high accuracy both for classification and regression when the alignment (K, KY ) between the kernel K and KY is high. In the regression setting, we shall assume that the labels have been first normalized by dividing by the standard deviation (assumed finite), thus E[y 2 ] = 1. In classification, y = ±1. Let h denote the hypothesis defined for all x  X , h (x) = Ex [y Kc (x, x )]
2 E[Kc ]

log 6  , 2m

Proof. To shorten the presentation, we first simplify the notation for the alignments as follows: b (K, K ) =  aa b and (K, K ) =  , aa
2

2 with b = E[Kc Kc ], a = E[Kc ], a = E[Kc ] and sim2 ilarly, b = (1/m ) Kc , Kc F , a = (1/m2 ) Kc 2 , and

.

Two-Stage Learning Kernel Methods

Observe that by definition of h , Ex [yh (x)] = (K, KY ). and  = For any x  X , define (x) = maxx (x). The following result shows that the hypothesis h has high accuracy when the kernel alignment is high and  not too large.1 Theorem 2 (classification). Let R(h ) = Pr[yh (x) < 0] denote the error of h in binary classification. For any 2 kernel K such that 0 < E[Kc ] < +, the following holds: R(h )  1 - (K, KY )/. Proof. Note that for all x  X ,
x 2 Ex [y 2 ] Ex [Kc (x, x )] 2 E[Kc ]
2 Ex [Kc (x,x )] 2 Ex,x [Kc (x,x )]

4. Algorithms
This section discusses two-stage algorithms for learning kernels in the form of linear combinations of p base kernels Kk , k  [1, p]. In all cases, the final hypothesis learned belongs to the reproducing kernel Hilbert space associated p to a kernel Kµ = k=1 µk Kk , where the mixture weights are selected subject to the condition µ  0, which guarantees that K is a PDS kernel, and a condition on the norm of µ, µ =  > 0, where  is a regularization parameter. In the first stage, these algorithms determine the mixture weights µ. In the second stage, they train a kernel-based algorithm, e.g., SVMs for classification, or KRR for regression, in combination with the kernel Kµ , to learn a hypothesis h. Thus, the algorithms differ only by the first stage, where Kµ is determined, which we briefly describe. Uniform combination (unif): this is the most straightforward method, which consists of choosing equal mixture p weights, thus the kernel matrix used is Kµ =  k=1 Kk . p Nevertheless, improving upon the performance of this method has been surprisingly difficult for standard (onestage) learning kernel algorithms (Cortes, 2009). Independent alignment-based method (align): this is a simple but efficient method which consists of using the training sample to independently compute the alignment between each kernel matrix Kk and the target kernel matrix KY = yy , based on the labels y, and to choose each mixture weight µk proportional to that alignment. Thus, the p resulting kernel matrix is: Kµ  k=1 (Kk , KY )Kk . Alignment maximization algorithms (alignf): the independent alignment-based method ignores the correlation between the base kernel matrices. The alignment maximization method takes these correlations into account. It determines the mixture weights µk jointly by seeking to maximize the alignment between the convex combip nation kernel Kµ = k=1 µk Kk and the target kernel KY = yy , as suggested by Cristianini et al. (2001); Kandola et al. (2002a) and later studied by Lanckriet et al. (2004) who showed that the problem can be solved as a QCQP. In what follows, we present even more efficient algorithms for computing the weights µk by showing that the problem can be reduced to a simple QP. We also examine the case of a non-convex linear combination, where components of µ can be negative, and show that the problem then admits a closed-form solution. We start with this linear combination case and partially use that solution to obtain the solution of the convex combination. 4.1. Alignment maximization algorithm - linear combination We can assume without loss of generality that the centered base kernel matrices Kk c are independent since oth-

|yh (x)| = |y E[y Kc (x, x )]|/  =

2 E[Kc ] 2 Ex [Kc (x, x )] 2 E[Kc ]

 .

In view of this inequality, and the fact that Ex [yh (x)] = (K, KY ), we can write: yh (x) 1{yh (x)0} ]  yh (x)  E[ ] = (K, KY )/,  where 1 is the indicator variable of an event .  E[ 1 - R(h ) = Pr[yh (x)  0] = E[1{yh (x)0} ]

A probabilistic version of the theorem can be straightforwardly derived by noting that by Markov's inequality, for  any  > 0, with probability at least 1-, |(x)|  1/ . Theorem 3 (regression). Let R(h ) = Ex [(y - h (x))2 ] denote the error of h in regression. For any kernel K such 2 that 0 < E[Kc ] < +, the following holds: R(h )  2(1 - (K, KY )). Proof. By the Cauchy-Schwarz inequality, it follows that: E[h 2 (x)] = E
x

Ex [y Kc (x, x )]2 2 x E[Kc ] 2 Ex [y 2 ] Ex [Kc (x, x )] E 2] x E[Kc 2 Ex [y 2 ] Ex,x [Kc (x, x )] = = E[y 2 ] = 1. 2] x E[Kc

Using again the fact that Ex [yh (x)] = (K, KY ), the error of h can be bounded as follows: E[(y - h (x))2 ] = E[h (x)2 ] + E[y 2 ] - 2 E[yh (x)]
x x x

 1 + 1 - 2(K, KY ).
A version of this result was presented by Cristianini et al. (2001; 2002) for the so-called Parzen window solution and noncentered kernels, but their proof implicitly relies on the fact that ^ E [K 2 (x,x )] ~ 1 maxx E x [K 2 (x,x )] 2 = 1 which holds only if K is constant.
x,x

1

Two-Stage Learning Kernel Methods

erwise we can select an independent subset. This condition ensures that Kµ c F > 0 for an arbitrary µ and that (Kµ , yy ) is well defined (Definition 2). By the properties of centering, Kµ c , KY c F = Kµ c , KY F . Thus, since KY c F does not depend on µ, alignment maximization can be written as the following optimization problem: max (Kµ , yy ) = max
µM µM

Thus,   Vec(M-1/2 a) with M-1/2  2 = 1. This -1 yields immediately µ = M-1 a , which verifies µ a = M a a M-1 a/ M-1 a  0 since M and M-1 are PSD. 4.2. Alignment maximization algorithm - convex combination In view of the proof of Proposition 2, the alignment maximization problem with the set M = { µ 2 = 1  µ  0} can be written as µ = argmax
µM

Kµ c , yy Kµ c F

F

,

(1)

where M = {µ: µ 2 = 1}. A similar set can be defined via norm-1 instead of norm-2. As we shall see, however, the problem can be solved in the same way in both cases. Note that, by definition of centering, Kµ c = Um Kµ Um with Um = I - 11 /m, thus, Kµ c = p p = k=1 µk Kk c . Let a denote the veck=1 µk Um Kk Um tor ( K1c , yy F , . . . , Kp c , yy F ) and M the matrix defined by Mkl = Kk c , Kl c F , for k, l  [1, p]. Note that since the base kernels are assumed independent, matrix M is invertible. Also, in view of the non-negativity of the Frobenius product of PSD matrices shown in Section 2.2, the entries of a and M are all non-negative. Observe also that M is a symmetric PSD matrix since for any vector X = (x1 , . . . , xm )  Rm ,
m m

µ aa µ . µ Mµ

(2)

The following proposition shows that the problem can be reduced to solving a simple QP. Proposition 3. Let v be the solution of the following QP: min v Mv - 2v a.
v0

(3)

Then, the solution µ of the alignment maximization problem (2) is given by µ = v / v . Proof. Note that the objective function of problem (2) is invariant to scaling. The constraint µ =1 only serves to enforce 0 < µ < +. Thus, using the same change of variable as in the proof of Proposition 2, we can instead solve the following problem from which we can retrieve the solution via normalization:  = argmax
0< M-1/2  2 <+ M-1/2 0

X MX =
k,l=1 m

xk xl Tr[Kkc Klc ]=Tr
m k,l=1 m

xk xl Kkc Klc xk Kkc
k=1 2 F

= Tr (
k=1

xk Kkc )(
l=1

xl Klc ) =

 0.

Proposition 2. The solution µ of the optimization prob-1 lem (1) is given by µ = M-1 a . M a Proof. With the notation introduced, problem (1) can be rewritten as µ = argmax µ 2 =1  µ a . Thus, clearly,
µ Mµ

 · (M-1/2 a) . 

2

Equivalently, we can solve the following problem for any finite  > 0:
M
-1/2

the solution must verify µ a  0. We will square the objective and yet not enforce this condition since, as we shall see, it will be verified by the solution we find. Therefore, we consider the problem µ aa µ (µ a)2 µ = argmax = argmax . µ Mµ µ 2 =1 µ 2 =1 µ Mµ In the final equality, we recognize the general Rayleigh quotient. Let  = M1/2 µ and  = M1/2 µ , then  = argmax
M-1/2 
2 =1

max
u0 u =

u · M-1/2 a .

2

Observe that for M-1/2 u  0 the inner product is nonnegative: u · M-1/2 a = M-1/2 u · a  0, since the entries of a are non-negative. Furthermore, it can be written as follows: u · M-1/2 a =- 1 1 1 u-M-1/2 a 2 + u 2 + M-1/2 a 2 2 2 2 1 1 =- u-M-1/2 a 2 + + M-1/2 a 2 . 2 2 2
2


M-1/2 aa M-1/2  .  
2

Therefore, the solution is  = = argmax
M-1/2 
2 =1

Thus, the problem becomes equivalent to the minimization:  M   
-1/2 2 2 2

a

M

-1/2

min
u0 u =

u - M-1/2 a

2

.

(4)

argmax
M-1/2 
2 =1

M-1/2 a .

Now, we can omit the condition on the norm of u since (4) holds for arbitrary finite  > 0 and since neither u = 0 or

Two-Stage Learning Kernel Methods

any infinite norm u can be the solution even without this condition. Thus, we can now consider instead:
M-1/2 u0

min

u - M-1/2 a

2

.

The change of variable u = M1/2 v leads to: 2 minv0 M1/2 v - M-1/2 a . This is a standard leastsquare regression problem with non-negativity constraints, a simple and widely studied QP for which several families of algorithms have been designed. Expanding the terms, we obtain the equivalent problem: min v Mv - 2v a .
v0

Table 1. Error measures (top) and alignment values (bottom) for (A) unif, (B) one-stage l2-krr or l1-svm, (C) align and (D) alignf with kernels built from linear combinations of Gaussian base kernels. The choice of 0 , 1 is listed in row labeled , and m is the size of the dataset used. Shown with ±1 standard deviation (in parentheses) measured by 5-fold cross-validation.
KINEMAT. IONOSPH . GERMAN SPAMBASE SPLICE

m 
A B C D

Note that this QP problem does not require a matrix inversion of M. Also, it is not hard to see that this problem is equivalent to solving a hard margin SVM problem, thus, any SVM solver can also be used to solve it. A similar problem with the non-centered definition of alignment is treated by Kandola et al. (2002b), but their optimization solution differs from ours and requires cross-validation.

351 1000 -3, 3 -3, 3 .138(.005) .467(.085) .158(.013) .242(.021) .137(.005) .457(.085) .155(.012) .248(.022) .125(.004) .445(.086) .173(.016) .257(.024) .115(.004) .442(.087) .176(.017) .273(.030) R EGRESSION

1000 -4, 3 25.9(1.8) .089(.008) 26.0(2.6) .082(.003) 25.5(1.5) .089(.008) 24.2(1.5) .093(.009)

1000 1000 -12, -7 -9, -3 18.7(2.8) 15.2(2.2) .138(.031) .122(.011) 20.9(2.80) 15.3(2.5) .099(.024) .105(.006) 18.6(2.6) 15.1(2.4) .140(.031) .123(.011) 18.0(2.4) 13.9(1.3) .146(.028) .124(.011) C LASSIFICATION

on the performance on the validation set, while the regularization parameters C and  are fixed since only the ratios C/ and / matter. The µ0 parameter is set to zero in Section 5.1, and is chosen to be uniform in Section 5.2. 5.1. General kernel combinations In the first set of experiments, we consider combinations of Gaussian kernels of the form K (xi , xj ) = exp(- xi - xj 2 ), with varying bandwidth parameter   {20 , 20 +1 , . . . , 21-1 , 21 }. The values 0 and 1 are chosen such that the base kernels are sufficiently different in alignment and performance. Each base kernel is centered and normalized to have trace equal to one. We test the algorithms on several datasets taken from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/) and Delve datasets (http://www.cs.toronto.edu/delve/data/datasets.html). Table 1 summarizes our results. For classification, we compare against the l1-svm method and report the misclassification percentage. For regression, we compare against the l2-krr method and report RMSE. In general, we see that performance and alignment are well correlated. In all datasets, we see improvement over the uniform combination as well as the one-stage kernel learning algorithms. Note that although the align method often increases the alignment of the final kernel, as compared to the uniform combination, the alignf method gives the best alignment since it directly maximizes this quantity. Nonetheless, align provides an inexpensive heuristic that increases the alignment and performance of the final combination kernel. To the best of our knowledge, these are the first kernel combination experiments for alignment with general base kernels. In the next section, we also examine rank-1 kernels, although not generated from a spectral decomposition as in (Cristianini et al., 2001).

5. Experiments
This section compares the performance of several learning kernel algorithms for classification and regression. We compare the algorithms unif, align, and alignf, from Section 4, as well as the following one-stage algorithms: Norm-1 regularized combination (l1-svm): this algorithm optimizes the SVM objective min max 2 1 -  Y Kµ Y
µ 

subject to: µ  0, Tr[Kµ ]  ,  y = 0, 0    C , as described by Lanckriet et al. (2004). Here, Y is the diagonal matrix constructed from the labels y and C is the regularization parameter of the SVM. Norm-2 regularized combination (l2-krr): this algorithm optimizes the kernel ridge regression objective min max -   -  Kµ  + 2 y
µ 

subject to: µ  0, µ - µ0

2

,

as described in Cortes et al. (2009a). Here,  is the regularization parameter of KRR, and µ0 is an additional regularization parameter for the kernel selection. In all experiments, the error measures reported are for 5fold cross validation, where, in each trial, three folds are used for training, one used for validation, and one for testing. For the two-stage methods, the same training and validation data is used for both stages of the learning. The regularization parameter  is chosen via a grid search based

Two-Stage Learning Kernel Methods

Table 2. The error measures (top) and alignment values (bottom) for kernels built with rank-1 feature based kernels on four domain sentiment analysis domains. Shown with ±1 standard deviation as measured by 5-fold cross-validation.
BOOKS DVD ELEC KITCHEN BOOKS DVD ELEC KITCHEN

1.442 ± .015 unif .029 ± .005 1.414 ± .020 l2-krr .031 ± .004 1.401 ± .035 align .046 ± .006

1.438 ± .033 1.342 ± .030 .029 ± .005 .038 ± .002 1.420 ± .034 1.318 ± .031 .031 ± .005 .042 ± .003 1.414 ± .017 1.308 ± .033 .047 ± .005 .065 ± .004 R EGRESSION

1.356 ± .016 .039 ± .006 1.332 ± .016 .044 ± .007 1.312 ± .012 .076 ± .008

25.8 ± 1.7 unif .030 ± .004 28.6 ± 1.6 l1-svm .029 ± .012 24.3 ± 2.0 align .043 ± .003

24.3 ± 1.5 18.8 ± 1.4 .030 ± .005 .040 ± .002 29.0 ± 2.2 23.8 ± 1.9 .038 ± .011 .051 ± .004 21.4 ± 2.0 16.6 ± 1.6 .045 ± .005 .063 ± .004 C LASSIFICATION

20.1 ± 2.0 .039 ± .007 23.8 ± 2.2 .060 ± .006 17.2 ± 2.2 .070 ± .010

5.2. Rank-1 kernel combinations In this set of experiments we use the sentiment analysis dataset from (http://www.seas.upenn.edu/~mdredze/datasets/sentiment/): books, dvd, electronics and kitchen. Each domain has 2,000 examples. In the regression setting, the goal is to predict a rating between 1 and 5, while for classification the goal is to discriminate positive (ratings  4) from negative reviews (ratings  2). We use rank-1 kernels based on the 4,000 most frequent bigrams. The kth base kernel, Kk , corresponds to the k-th bigram count vk , Kk = vk vk . Each base kernel is normalized to have trace 1 and the labels are centered. The alignf method returns a sparse weight vector due to the constraint µ  0. As is demonstrated by the performance of the l1-svm method (Table 2) and also previously observed by Cortes et al. (2009a), a sparse weight vector µ does not generally offer an improvement over the uniform combination in the rank-1 setting. Thus, we focus on the performance of align and compare it to unif and one-stage learning methods. Table 2 shows that align significantly improves both the alignment and the error percentage over unif and also improves somewhat over the one-stage l2-krr algorithm. Although the sparse weighting provided by l1-svm improves the alignment in certain cases, it does not improve performance.

similarity functions. In ICML, pp. 73­80, 2006. Cortes, C. Invited talk: Can learning kernels help performance? In ICML 2009, pp. 161, 2009. Cortes, C. and Vapnik, Vladimir. Support-Vector Networks. Machine Learning, 20(3), 1995. Cortes, C., Mohri, A., and Rostamizadeh, A. L2 regularization for learning kernels. In UAI, 2009a. Cortes, C., Mohri, A., and Rostamizadeh, A. Learning nonlinear combinations of kernels. In NIPS, 2009b. Cortes, C., Mohri, A., and Rostamizadeh, A. Generalization bounds for learning kernels. In ICML '10, 2010. Cristianini, N., Shawe-Taylor, J., Elisseeff, A., and Kandola, J. S. On kernel-target alignment. In NIPS, 2001. Cristianini, N., Kandola, J. S., Elisseeff, A., and ShaweTaylor, J. On kernel target alignment. www.supportvector.net/papers/alignment JMLR.ps, unpublish., 2002. Kandola, J. S., Shawe-Taylor, J., and Cristianini, N. On the extensions of kernel alignment. tech. report 120, Dept. of Computer Science, Univ. of London, UK, 2002a. Kandola, J. S., Shawe-Taylor, J., and Cristianini, N. Optimizing kernel alignment over combinations of kernels. tech. report 121, Univ. of London, UK, 2002b. Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L. El, and Jordan, M. Learning the kernel matrix with semidefinite programming. JMLR, 5, 2004. Meila, M. Data centering in feature space. AISTATS, 2003. Micchelli, C. and Pontil, M. Learning the kernel function via regularization. JMLR, 6, 2005. Ong, C. S., Smola, A., and Williamson, R. Learning the kernel with hyperkernels. JMLR, 6, 2005. Pothin, J.-B. and Richard, C. Optimizing kernel alignment by data translation in feature space. In ICASSP, 2008. Shawe-Taylor, J. and Cristianini, N. Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004. Srebro, N. and Ben-David, S. Learning bounds for support vector machines with learned kernels. In COLT, 2006.

6. Conclusion
We presented a series of novel theoretical, algorithmic, and empirical results for a two-stage learning kernel algorithm based on a notion of alignment. Our experiments show a consistent improvement of the performance over previous learning kernel techniques, as well as the straightforward uniform kernel combination, which has been difficult to surpass in the past. These improvements could suggest a better one-stage algorithm with a regularization term taking into account the alignment quality of each base kernel.

References
Bach, F. Exploring large feature spaces with hierarchical multiple kernel learning. In NIPS, 2008. Balcan, M.-F. and Blum, A. On a theory of learning with