Metric Embedding for Kernel Classification Rules

Bharath K. Sriperumbudur B H A R AT H S V @ U C S D . E D U Omer A. Lang OLANG@UCSD.EDU Gert R. G. Lanckriet G E RT @ E C E . U C S D . E D U Department of Electrical and Computer Engineering, University of California, San Diego, CA 92093 USA.

Abstract
In this paper, we consider a smoothing kernel based classification rule and propose an algorithm for optimizing the performance of the rule by learning the bandwidth of the smoothing kernel along with a data-dependent distance metric. The data-dependent distance metric is obtained by learning a function that embeds an arbitrary metric space into a Euclidean space while minimizing an upper bound on the resubstitution estimate of the error probability of the kernel classification rule. By restricting this embedding function to a reproducing kernel Hilbert space, we reduce the problem to solving a semidefinite program and show the resulting kernel classification rule to be a variation of the k -nearest neighbor rule. We compare the performance of the kernel rule (using the learned data-dependent distance metric) to state-of-the-art distance metric learning algorithms (designed for k -nearest neighbor classification) on some benchmark datasets. The results show that the proposed rule has either better or as good classification accuracy as the other metric learning algorithms.

ter 10) is given by  0 1 gn (x) = 

if

x-Xi h x-Xi i=1 1{Yi =1} K h otherwise,
i=1 1{Yi =0} K n

n

(1)

1. Introduction
Parzen window methods, also called smoothing kernel rules are widely used in nonparametric density estimation and function estimation, and are popularly known as kernel density and kernel regression estimates, respectively. In this paper, we consider these rules for classification. To this end, let us consider the binary classification problem of classifying x  RD , given an i.i.d. training sample {(Xi , Yi )}n 1 drawn from some unknown distribution D, i= where Xi  RD and Yi  {0, 1},  i  [n] := {1, . . . , n}. The kernel classification rule (Devroye et al., 1996, ChapAppearing in Proceedings of the 25 International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).
th

where K : RD  R is a kernel function, which is usually nonnegative and monotone decreasing along rays starting from the origin. The number h > 0 is called the smoothing factor, or bandwidth, of the kernel function, which provides some form of distance weighting. We warn the reader not to confuse the kernel function, K , with the reproduc¨ ing kernel (Scholkopf & Smola, 2002) of a reproducing kernel Hilbert space (RKHS), which we will denote with K.1 When K (x) = 1{ x 2 1} (x) (sometimes called the na¨ve kernel), the rule is similar to the k -nearest neighbor i (k -NN) rule except that k is different for each Xi in the training set. The k -NN rule classifies each unlabeled example by the majority label among its k -nearest neighbors in the training set, whereas the kernel rule with the na¨ve i kernel classifies each unlabeled example by the majority label among its neighbors that lie within a radius of h. De vroye and Krzyzak (1989) proved that for regular kernels (see Devroye et al., (1996, Definition 10.1)), if the smoothing parameter h  0 such that nhD   as n  , then the kernel classification rule is universally consistent. But, for a particular n, asymptotic results provide little guidance in the selection of h. On the other hand, selecting the wrong value of h may lead to very poor error rates. In fact, the crux of every nonparametric estimation problem is the choice of an appropriate smoothing factor. This is one of the questions that we address in this paper by proposing an algorithm to learn an optimal h. The second question that we address is learning an optimal distance metric. For x  RD , K is usually a function of x 2 . Some popular kernels include the Gaus2 sian kernel, K (x) = e- x 2 ; the Cauchy kernel, K (x) =
Unlike K, K is not required to be a positive definite function. If K is a positive definite function, then it corresponds to a translation-invariant kernel of some RKHS defined on RD . In such a case, the classification rule in Eq. (1) is similar to the ones that appear in kernel machines literature.
1


Metric Embedding for Kernel Classification Rules

1/(1 + x D+1 ); and the Epanechnikov kernel K (x) = 2 (1 - x 2 )1{ x 2 1} .2 Snapp and Venkatesh (1998) have 2 shown that the finite-sample risk of the k -NN rule may be reduced, for large values of n, by using a weighted Euclidean metric, even though the infinite sample risk is independent of the metric used. This has been experimentally confirmed by Xing et al. (2003); Shalev-Shwartz et al. (2004); Goldberger et al. (2005); Globerson and Roweis (2006); Weinberger et al. (2006). They all assume the met( x - y )T (x - y ) = L(x - y ) 2 ric to be (x, y ) = D for x, y  R , where  = LT L is the weighting matrix, and optimize over  to improve the performance of the k NN rule. Since the kernel rule is similar to the k -NN rule, one would expect that its performance can be improved by making K a function of Lx 2 . Another way to interpret this is to find a linear transformation L  Rd×D so that the transformed data lie in a Euclidean metric space. Some applications call for natural distance measures that reflect the underlying structure of the data at hand. For example, when computing the distance between two images, tangent distance would be more appropriate than the Euclidean distance. Similarly, while computing the distance between points that lie on a low-dimensional manifold in RD , geodesic distance is a more natural distance measure than the Euclidean distance. Most of the time, since the true distance metric is either unknown or difficult to compute, Euclidean or weighted Euclidean distance is used as a surrogate. In the absence of prior knowledge, the data may be used to select a suitable metric, which can lead to better classification performance. In addition, instead of x  RD , suppose x  (X , ), where X is a metric space with  as its metric. One would like to extend the kernel classification rule to such X . In this paper, we address these issues by learning a transformation that embeds the data from X into a Euclidean metric space while improving the performance of the kernel classification rule. The rest of the paper is organized as follows. In §2, we formulate the multi-class kernel classification rule and propose learning a transformation, , (that embeds the training data into a Euclidean space) and the bandwidth parameter, h, by minimizing an upper bound on the resubstitution estimate of the error probability. To achieve this, in §3, we restrict  to an RKHS and derive a representation for it by invoking the generalized representer theorem. Since the resulting optimization problem is non-convex, in §4, we approximate it with a semidefinite program when K is a na¨ve kernel. We present experimental results in §5, i wherein we show on benchmark datasets that the proposed algorithm performs better than k -NN and state-of-the-art metric learning algorithms developed for the k -NN rule.
The Gaussian kernel is a positive definite function on RD while the Epanechnikov and na¨ve kernels are not. i
2

2. Problem Formulation
Let {(Xi , Yi )}n 1 denote an i.i.d. training set drawn from i= some unknown distribution D where Xi  (X , ) and Yi  [L], with L being the number of classes. The multi-class kernel classification rule is given by gn (x) = arg max
l[L]

in
=1

1{Yi =l} KXi ,h (x),

(2)

 f , where K : X  R+ and Kx0 ,h (x) =  (xhx0 ) or some nonnegative function, , with (0) = 1. The probability of error associated with the above rule is L(gn ) := Pr(X,Y )D (gn (X ) = Y ) where Y is the true label associated with X . Since D is unknown, L(gn ) cannot be computed directly but can only be estimated from the training set. The resubstitution estimate,3 L(gn ), which counts the number of errors committed on the training set by the clasn 1 sification rule, is given by L(gn ) := n i=1 1{gn (Xi )=Yi } . As aforementioned, when X = RD ,  is usually chosen to be . 2 . Previous works in distance metric learning learn a linear transformation L : RD  Rd leading to the d(stance metric, L (Xi , Xj ) := LXi - i LXj 2 = Xi - Xj )T (Xi - Xj ), where  captures the variance-covariance structure of the data. In this work, our goal is to jointly learn h and a measurable function,   C := { : X  Rd }, so that the resubstitution estimate of the error probability is minimized with  (Xi , Xj ) := (Xi ) - (Xj ) 2. Once h and  are known, the kernel classification rule is completely specified by Eq. (2). Devroye et al., (1996, Section 25.6) show that kernel rules of the form in Eq. (1) picked by minimizing L(gn ) with smoothing factor h > 0 are generally inconsistent if X is nonatomic. The same argument can be extended to the multi-class rule given by Eq. (2). To learn , simply minimizing L(gn ) without any smoothness conditions on  can lead to kernel rules that overfit the training set. Such a  can be constructed as follows. Let nl be the number of points that belong to lth class. Suppose n1 = n2 = · · · = nL . Then for any h  1, choosing (X ) = Yi when X = Xi and (X ) = 0 when X  {Xi }n 1 clearly / i= yields zero resubstitution error. However, such a choice of  leads to a kernel rule that always assigns the unseen data to class 1, leading to very poor performance. Therefore, to avoid overfitting to the training set, the function class C should satisfy some smoothness properties so that highly non-smooth functions like the one we defined above are not chosen while minimizing L(gn ). To this end, we introduce a penalty functional,  : C  R+ , which penalizes
3 Apart from the resubstitution estimate, holdout and deleted estimates can also be used to estimate the error probability. These estimates are usually more reliable but more involved than the resubstitution estimate.


Metric Embedding for Kernel Classification Rules

non-smooth functions in C so that they are not selected.4 Therefore, our goal is to learn  and h by minimizing the regularized error functional given as Lreg (, h) =
n 1i 1{gn (Xi )=Yi } +  n =1 

[],

(3)

where   C , h > 0 and the regularization parameter,  > 0. gn in Eq. (3) is given by Eq. (2), with  replaced by  . Minimizing Lreg (, h) is equivalent to minimizing the number of training instances for which gn (X ) = Y , over the function class, { : []  s}, for some appropriately chosen s. Consider gn (x) defined in Eq. (2). Suppose Yi = k for some Xi . Then gn (Xi ) = k if and only if { {   KXj ,h (Xi )  max KXj ,h (Xi ), (4)
j :Yj =k}
l[L] l=k

gradient optimization is difficult because the gradients are zero almost everywhere. In addition to the computational hardness, the problem in Eq. (5) is not theoretically solvable unless some assumptions about C are made. In the following section, we assume C to be an RKHS with the reproducing kernel K and provide a representation for the optimum  that minimizes Eq. (5). We remind the reader that K is a smoothing kernel which is not required to be a positive definite function but takes on positive values, while K is a reproducing kernel which is positive definite and can take negative values.

3. Regularization in Reproducing Kernel Hilbert Space
Many machine learning algorithms like SVMs, regularization networks, and logistic regression can be derived within the framework of regularization in RKHS by choosing an appropriate empirical risk functional with the penalizer being the squared RKHS norm (see Evgeniou et al. (2000)). In Eq. (5), we have extended this framework to kernel classification rules, wherein we compute the   C and h > 0 that minimize an upper bound on the resubstitution estimate of the error probability. To this end, we choose C to be an RKHS with the penalty functional being the squared RKHS norm,8 i.e., [] =  2 . By fixing h, C n the objective function i=1 i (, h) in Eq. (5) depends on  only through { (Xi ) - (Xj ) 2}nj =1 . By letting i, w { n here (Xi ) - (Xj ) 2}nj =1 i, i=1 i (, h) = h n2 h : R  R+ , Eq. (5) can be written as { + min min h (Xi ) - (Xj ) 2}nj =1   2 . (6) i, C
h>0 C

j :Yj =l}

where the superscript  is used to indicate the dependence of K on .5 Since the right hand side of Eq. (4) involves the max function which is not differentiable, we use m the inequality malx{a1 , . n. , am }  . i=1 ai to upper bound6 it with [L] j =1 1{Yj =l} KXj (Xi ). Thus, to l=k n maximize i=1 1{gn (Xi )=Yi } , we maximize its lower bound given by n , n n i=1 1
j =1 j =i

1{Yj =Yi } KXj (Xi )  7

j =1

1{Yj =Yi } KXj (Xi )

resulting in a conservative rule. In the above bound, we use j = i just to make sure that (Xi ) is not the only point within its neighborhood of radius h. Define ij := 2Yi ,Yj - 1 where  represents the Kronecker delta. Then, the problem of learning  and h by minimizing Lreg (, h) in Eq. (3) reduces to solving the following optimization problem, min
, h

in
=1

, i (, h) +  [] :   C , h > 0 (5)

where  = n and i (, h) given by i 1 n
j =1 j =i

1{ij =1} KXj (Xi ) <

n
j =1

1{ij =-1} KXj (Xi )

The following result provides a representation for the minimizer of Eq. (6), and is proved in Appendix A. We remind the reader that  is a vector-valued mapping from X to Rd . Theorem 1 (Multi-output regularization). Suppose C = { : X  Rd } = H1 ×. . .×Hd where Hi is an RKHS with reproducing kernel Ki : X ×X  R and  = (1 , . . . , d ) with Hi i : X  R. Then each minimizer   C of Eq. (6) admits a representation of the form j = where cij  R and in cij Kj (., Xi ),  j  [d]
i=1 cij

s an upper bound on 1{gn (Xi )=Yi } for i  [n]. Solving the above non-convex optimization problem is NP-hard. The
This is equivalent to restricting the size of the function class C from which  has to be selected. 5 To simplify the notation, from now onwards, we write  KXj ,h (Xi ) as KXj (Xi ) where the dependence of K on  and h is implicit. 6 Another upper bound that cmn be us.ed for the max function a ai is max{a1 , . . . , am }  log i=1 e 7 Using the upper bound of max function in Eq. (4) makes the resulting kernel rule conservative as there can be samples from the training set that do not satisfy this inequality but get correctly classified according to Eq. (2).
4

(7)

=1 n

= 0,  i  [n],  j  [d].

8 Another choice for C could be the space of bounded Lipschitz functions with the penalty functional, [] =  L , where  L is the Lipschitz semi-norm of . With this choice of C and , von Luxburg and Bousquet (2004) studied large margin classification in metric spaces. One more interesting choice for C could be the space of Mercer kernel maps. It can be shown that solving for  in Eq. (5) with such a choice for C is equivalent to learning the kernel matrix associated with  and {Xi }n 1 . However, this i= approach is not useful as it does not allow for an out-of-sample extension.


Metric Embedding for Kernel Classification Rules

Remark 2. (a) By Eq. (7),  is completely determined by {cij : i  [n], j  [d]}. Therefore, the problen of learning m  reduces to learning n · d scalars, {cij : i=1 cij = 0}. (b) h in Eq. (6) depends on  through (.) - (.) 2. Therefore, for any z , w  X , we have cT z 2 d w (z) - (w) 2 = = m (km - km ) 2 m =1 d tr(m (kz - kw )(kz - kw )T ) where cm := m m m m m=1 (c1m , . . . , cnm )T , kz := (Km (z , X1 ), . . . , Km (z , Xn ))T , m m := cm cT ,  m  [d] and tr(.) represents the trace. m (c) The regularizer,  2 in Eq. (6) is given by  2 = C dC n d m 2 m = m=1 i,j =1 cim cj m Km (Xi , Xj ) = H m=1 d d cT Km cm = m m=1 tr(Km m ) where Km := m=1 (kX1 , . . . , kXn ). m m (d) Since  appears in the form of  and  Eq. (6), learning  is equivalent to learning {m rank(m ) = 1, 1T m 1 = 0}d =1 . m
2 C

4. Convex Relaxations & Semidefinite Program
Having addressed the theoretical issue of making assumptions about C to solve Eq. (5), we return to address the computational issue pointed out in §2. The program in Eq. (5) is NP-hard because of the nature of {i }n 1 . This issue i= can be alleviated by minimizing a convex upper bound of i , instead of i . Some of the convex upper bounds for the function  (x) = 1{x>0} are (x) = max(0, 1 + x) := [1 + x]+ , (x) = log (1 + ex ) etc. Replacing i by i in Eq. (5) results in the following program, min
C h>0

in
=1

i

-
i

+ (, h) - i (, h)

+


2 C,

(8)

in 0:

In the above remark, we have shown that h and  C in Eq. (6) depend only on the entries in d kernel matrices (associated with d kernel functions) and n · d scalars, {cij }. In addition, we also reduced the representation of  from {cm }d =1 to {m }d =1 . It can be seen that 2 and  2  m m C are convex quadratic functions of {cm }d =1 , while they are m linear functions of {m }d =1 . Depending on the nature of m K , one representation would be more useful than the other. Corollary 3. Suppose K1 = . . . = Kd = K. Then, for any z , w  X , 2 (z , w) is the Mahalanobis distance between  d kz and kw , with m=1 m as its metric. Proof. By Remark 2, we have 2 (z , w) = (z) -  d T (kz - kw ) m (kz - kw ). Since (w) 2 = m m m m 2 m=1 K1 = . . . = Kd = K, we have kz = . . . = kz = kz . 1 d Therefore, 2 (z , w) = (kz - kw )T (kz - kw ) where  d  := m=1 m . The above result reduces the problem of learning  to learning a matrix,  0, such that rank()  d and 1T 1 = 0. We now study the above result for linear kernels. The following corollary shows that applying a linear kernel is equivalent to assuming the underlying distance metric in X to be the Mahalanobis distance. Corollary 4 (Linear kernel). Let X = RD and z , w  X . If K(z , w) = z, w 2 = z T w, then (z ) = Lz  Rd and (z) - (w) 2 = (z - w)T A(z - w) with A := LT L. 2 Proof. By Remark 2 and Corollary 3, we have m (z ) = n n T m i=1 cim K(z , Xi ) = ( i=1 cim Xi ) z =: T z . Therefore, (z ) = Lz , where L := ( 1, . . . , d)T . In addition, (z)-(w) 2 = (z -w)T A(z -w) with A := LT L. 2 In the following section, we use these results to derive an algorithm that jointly learns  and h by solving Eq. (5).

j + - where i (, h) := =i KXj (Xi ) and  (, h) := i ij =1 { j :ij =-1} KXj (Xi ). Eq. (8) can still be computationally hard to solve depending on the choice of the smoothing kernel, K . Even if we choose K such that  + and  - are jointly convex in  and h for some representation of  (see Remark 2), Eq. (8) is still non-convex as the argument of i is a difference of two convex functions.9 In addition, if (x) = [1 + x]+ , then Eq. (8) is a d.c. (difference of convex functions) program (Horst & Thoai, 1999), which is NP-hard to solve. So, even for the nicest of cases, one has to resort to local optimization methods or computationally intensive global optimization methods. Nevertheless, if one does not worry about this disadvantage, then solving Eq. (8) yields  (in terms of {cm }d =1 or {m }d =1 , m m depending on the chosen representation) and h that can be used in Eq. (2) to classify unseen data. However, in the following, we show that Eq. (8) can be turned into a convex program for the na¨ve kernel. As mentioned in §1, this i choice of kernel leads to a classification rule that is similar in principle to the k -NN rule. 4.1. Na¨ve kernel: Semidefinite relaxation i The na¨ve kernel, Kx0 (x) = 1{ (x,x0 )h} , indicates that i the points, (x), that lie within a ball of radius h centered at (x0 ) have a weighting factor of 1, while the remaining points have zero weight. Using this in Eq. (8), we have { - + i (, h) - i (, h) = j :ij =-1} 1{ (Xi ,Xj )h} - { 1{ (Xi ,Xj )h} + 1, which represents the difj :ij =1} ference between number of points with label different from Yi that lie within the ball of radius of h centered at (Xi ) and the number of points with the same label as Xi (excluding Xi ) that lie within the same ball. If this difference is
For example, let K be a Gaussian kernel, Ky (x) = exp(-2 (x, y )/h). Using the {m }d =1 representation for ,  m we have 2 (x, y ) is linear in {m }d =1 and therefore, Ky (x) is  m convex in {m }d =1 . Here, we assume h to be fixed. This means m + - i and i are convex in .
9


Metric Embedding for Kernel Classification Rules

positive, then the classification rule in Eq. (2) makes an error in classifying Xi . Therefore,  and h should be chosen such that this misclassification rate is minimized. Suppose that {(Xi )}n 1 is given. Then, h determines the misclasi= sification rate like k in k -NN. It can be seen that the kernel classification rule and k -NN rule are similar when K is a na¨ve kernel. In the case of k -NN, the number of nearest i neighbors are fixed for any point, whereas with the kernel rule, it varies for every point. On the other hand, the radius of the ball containing the nearest neighbors of a point varies with every point in the k -NN setting while it is the same for every point in the kernel rule. - can be further reduced to a more a{ enable form by the following nalgebra. Usm ing = j :ij =1} 1{ (Xi ,Xj )h} j =1 1{ij =1} - { - we get i (, h) - j :ij =1} 1{ (Xi ,Xj )>h} , n + + i (, h) = 1 - ni + j =1 1{ij 2 (Xi ,Xj )>ij h} where ~  n ~ n+ := 1{ij =1} and h := h2 . Note that we have i j =1 neglected the set {j : ij = -1;  (Xi , Xj ) = h} in the above calculation for simplicity. Using (x) = [1 + x]+ , the first half of the objective n2 function in Eq. (8) reduces to - n+ + i i=1 + n . Applying the convex ~ j =1 1{ij 2 (Xi ,Xj )>ij h}


Algorithm 1 Gradient Projection Algorithm Require: {Mij }nj =1 , K, {ij }nj =1 , {n+ }n 1 ,  > 0, i, i, i i= > 0 and {i , i } > 0 (see Eq. (9)) ~ 1: Set t = 0. Choose 0  A and h0 > 0. 2: repeat 1 + n ~ 3: At = {i : + ij tr(Mij t ) - ij ht + j =1
4: 5: 6: 7: 8: 9: 10:

- i (, h)

+ i (, h)

2  n+ } × {j : j  [n]} i ~ Bt = {(i, j ) : 1 + ij tr(Mij t ) > ij ht } Nt = Bt \At ( t+1 = PN (t - t i,j )Nt ij Mij - t K) ( ~ ~ ht+1 = max( , ht + t i,j )Nt ij ) t=t+1 until convergence ~ return t , ht

relaxation one more time to the step function, we get ++ 1 n2 n ~ - n+ + j =1 + ij 2 (Xi , Xj ) - ij h  i i=1 as an upper bound on the first half of the objective function in Eq. (8). Since 2 is a quadratic function of  {cm }d =1 , it can be shown that representing  in terms m of {cm }d =1 results in a d.c. program, whereas its reprem sentation in terms of {m }d =1 results in a semidefinite m program (SDP) (except for the rank constraints), since 2 is linear in {m }d =1 . Assuming for simplicity that  m K1 = . . . = Kd = K and neglecting the constraint rank()  d, we obtain the following SDP, ++ in 2 jn 1 ~ min - n+ + + ij tr(Mij ) - ij h
~ ,h i =1 =1

gradient method (which scales as O(n2 ) per iteration) and an alternating projections method (which scales as O(n3 ) per iteration). At each iteration, we take a small step in the direction of the negative gradient of the objective function, 0 followed by a projection onto the set N = { :  ~ , 1T 1 = 0} and {h > 0}. The projection onto N is performed by an alternating projections method which involves projecting a symmetric matrix alternately between the convex sets, A = { :  0} and B = { : 1T 1 = 0}. Since A  B = , this alternating projections method is guaranteed to find a point in A  B . Given any A0  A, the alternating projections algorithm computes Bm = PB (Am )  B , Am+1 = PA (Bm )  A, m = 0, 1, 2, . . . , where PA and PB are the projection on A and B, respectively. In summary, the update rule can be given as n T A Bm = Am - 1 n2m 1 11T and Am+1 = i=1 [i ]+ ui uT i where {ui }n 1 and {i }n 1 are the eigenvectors and eigeni= i= values of Bm .10 A pseudocode of the gradient projection algorithm to solve Eq. (9) is shown in Algorithm 1. ~ Having computed  and h that minimize Eq. (9), a test point, x  X , can be classified by using the kernel rule in Eq. (2), where KXi (x) = 1{ (x,Xi )h} with 2 (x, Xi ) =  (kx - kXi )T (kx - kXi ). Therefore,  and h completely specify the classification rule.

s.t.

+ tr(K) ~  0, 1T 1 = 0, h > 0,

(9)

where Mij := (kXi - kXj )(kXi - kXj )T . For notational details, refer to Remark 2 and Corollary 3. Since one does not usually know the optimal embedding dimension, d, the  representation is advantageous as it is independent of d (as we neglected the rank constraint) and depends only on n. On the other hand, it is a disadvantage as the algorithm does not scale well to large datasets. Although the program in Eq. (9) is convex, solving it by general purpose solvers that use interior point methods scales as O(n6 ), which is prohibitive. Instead, following the ideas of Weinberger et al. (2006), we used a first order

5. Experiments & Results
In this section, we compare the performance of our method (referred to as kernel classification rule (KCR)) to several metric learning algorithms on a supervised classification task in terms of the training and test errors. The training phase of KCR involves solving the SDP in Eq. (9) to learn optimal  and h from the data, which are then used in Eq. (2) to classify the test data. Note that the SDP in Eq. (9)
Given Am  A, Bm is obtained by solving min{ Bm - Am 2 : 1T Bm 1 = 0}. Similarly, for a given Bm  B, Am+1 F is obtained by solving min{ Am+1 - Bm 2 : Am+1 0}. F
10


Metric Embedding for Kernel Classification Rules

Table 1. k-NN classification accuracy on UCI datasets. The algorithms compared are k-NN (with Euclidean distance metric), LMNN (large margin NN by Weinberger et al. (2006)), Kernel-NN (see footnote 11), KMLCC (kernel version of metric learning by collapsing classes by Globerson and Roweis (2006)), KLMCA (kernel version of LMNN by Torresani and Lee (2007)), and KCR (proposed method). Mean (µ) and standard deviation ( ) of the train and test (generalization) errors (in %) are reported.
Dataset (n, D, l) Balance (625, 4, 3) Ionosphere (351, 34, 2) Iris (150, 4, 3) Wine (178, 13, 3) Algorithm/ Error Train Test Train Test Train Test Train Test k-NN µ± 17.81 ± 1.86 18.18 ± 1.88 15.89 ± 1.43 15.95 ± 3.03 4.30 ± 1.55 4.02 ± 2.22 5.89 ± 1.35 6.22 ± 2.70 LMNN µ± 11.40 ± 2.89 11.49 ± 2.57 3.50 ± 1.18 12.14 ± 2.92 3.25 ± 1.15 4.11 ± 2.26 0.90 ± 2.80 3.41 ± 2.10 Kernel-NN µ± 10.73 ± 1.32 17.46 ± 2.13 2.84 ± 0.80 5.81 ± 2.25 3.60 ± 1.33 4.83 ± 2.47 4.95 ± 1.35 7.37 ± 2.82 KMLCC µ± 10.27 ± 2.01 9.75 ± 1.92 7.05 ± 1.31 6.54 ± 2.18 3.61 ± 1.59 3.89 ± 1.55 4.48 ± 1.21 4.84 ± 2.47 KLMCA µ± 9.93 ± 1.86 10.54 ± 1.46 3.98 ± 1.94 5.19 ± 2.09 3.27 ± 1.63 3.74 ± 2.21 2.18 ± 2.58 5.17 ± 1.91 KCR µ± 10.47 ± 2.11 8.94 ± 3.12 2.73 ± 1.03 5.71 ± 2.60 2.29 ± 1.62 3.27 ± 1.87 1.01 ± 0.73 2.13 ± 1.24

is obtained by using the na¨ve kernel for K in Eq. (2). For i other smoothing kernels, one has to solve the program in Eq. (8) to learn optimal  and h. Therefore, the results reported in this section under KCR refer to those obtained by using the na¨ve kernel. i The algorithms used in the comparative evaluation are: · The k -NN rule with the Euclidean distance metric. · The LMNN (large margin nearest neighbor) method proposed by Weinberger et al. (2006), which learns a Mahalanobis distance metric by minimizing the distance between predefined target neighbors and separating them by a large margin from the examples with non-matching labels. · The Kernel-NN rule, which uses the empirical kernel maps11 as training data and performs k -NN classification on this data using the Euclidean distance metric. · The KMLCC (kernel version of metric learning by collapsing classes) method proposed by Globerson and Roweis (2006), which learns a Mahalanobis distance metric in the kernel space by trying to collapse all examples in the same class to a single point while pushing examples in other classes infinitely far away. · The KLMCA (kernel version of large margin component analysis) method proposed by Torresani and Lee (2007), which is a non-convex, kernelized version of LMNN. Four benchmark datasets from the UCI machine learning repository were considered for experimentation. Since the proposed method and KMLCC solve an SDP that scales poorly with n, we did not consider large problem sizes for experimentation.12 The results shown in Table 1 are
11 Kernel-NN is computed as follows. For each training point, Xi , the empirical map w.r.t. {Xj }n=1 defined as kXi := j (K(X1 , Xi ), . . . , K(Xn , Xi ))T is computed. Then, {kXi }n 1 is i= considered to be the training set for the NN classification of empirical maps of the test data using the Euclidean distance metric. 12 To extend KCR to large datasets, one can represent  in terms of {cm }, which leads to a non-convex program as in KLMCA.

the average performance over 20 random splits of the data with 50% for training, 20% for validation and 30% for test2 ing. The Gaussian kernel, K(x, y ) = e- x-y 2 was used for the kernel based methods, i.e., Kernel-NN, KMLCC, KLMCA and KCR. The parameters  and  (only  for Kernel-NN) were set with cross-validation by searching over   {2i }4 4 and   {10i }3 3 . While testing, KCR - - uses the rule in Eq. (2), whereas the k -NN rule was used for all the other methods.13 It is clear from Table 1 that KCR almost always performs as well as or significantly better than all other methods. However, on the timing front (which we do not report here), KLMCA, which solves a non-convex program for n · d variables, is much faster than KMLCC and KCR, which solve SDPs involving n2 variables. The role of empirical kernel maps is not clear as there is no consistent behavior between the performance accuracy achieved with k -NN and Kernel-NN. KMLCC, KLMCA, and KCR learn the Mahalanobis distance metric in Rn which makes it difficult to visualize the class separability achieved by these methods. To visually appreciate their behavior, we generated a synthetic two dimensional dataset of 3 classes with each class being sampled from a Gaussian distribution with different mean and covariance. Figure 1(a) shows this dataset where the three classes are shown in different colors. Using this as training data, distance metrics were learned using KMLCC, KLMCA and KCR. If  is the learned metric, then the two dimensional projectio of x  Rn is obtainednas x = Lx n  where L = ( 1 u1 , 2 u2 )T , with  = i=1 i ui uT , i and 1  2 > · · · n . Figure 1(b-d) show the two diAlthough KCR, LMNN, and KMLCC solve SDPs to compute the optimal distance metric, KCR has fewer number of parameters to be tuned compared to these other methods. LMNN requires cross-validation over k (in k-NN) and the regularization parameter along with the knowledge about target neighbors. KMLCC requires cross-validation over k, the kernel parameter,  and the regularization parameter. In KCR, we only need to cross-validate over  and . In addition, if X = RD and K is a linear kernel, then KCR only requires cross-validation over  while computing the optimal Mahalanobis distance metric.
13


Metric Embedding for Kernel Classification Rules
k-NN, Error = 6.67
10 5 0 -5 -10 -10 -5 0 5 10 1.5 1 0.5 0 -0.5 -1 -1.5 -1 0 1

KMLCC, Error = 3.33
1.5 1 0.5 0 -0.5 -1 -1.5

KLMCA, Error = 2.67
1.5 1 0.5 0 -0.5 -1 -1.5 -1 0 1

KCR, Error = 2.67

-1

0

1

(a)
k-NN, Error = 10.7
10 5 0.5 0 -5 -1 -10 -10 -1.5 -5 0 5 10 -1 0 -0.5 1.5 1

(b)
KMLCC, Error = 10.0
1.5 1 0.5 0 -0.5 -1 -1.5 0 1 -1

(c)
KLMCA, Error = 8.0
1.5 1 0.5 0 -0.5 -1 -1.5 0 1 -1

(d)
KCR, Error = 8.0

0

1

(a

)

(b )

(c

)

(d )

Figure 1. Dataset visualization results of k-NN, KMLCC, KLMCA and KCR applied to a two-dimensional synthetic dataset of three classes with each class being modeled as a Gaussian. (a,a ) denote two independent random draws from this distribution whereas (b-d, b -d ) represent the two-dimensional projections of these data using the metric learned from KMLCC, KLMCA and KCR. The points in bold represent the misclassified points. It is interesting to note that KLMCA and KCR generate completely different embeddings but have similar error rates. See §5 for more details.

mensional projections of the training set using KMLCC, KLMCA and KCR. The projected points were classified using k -NN if  was obtained from KMLCC/KLMCA and using Eq. (2) if  was obtained from KCR. The misclassified points are shown in bold. Since the classification is done on the training points, one would expect better error rate and separability between the classes. To understand the generalization performance, a new data sample shown in Figure 1(a ) was generated from the same distribution as the training set. The learned  was used to obtain the two dimensional projections of the new data sample which are shown in Figure 1(b -d ). It is interesting to note that KLMCA and KCR generate completely different projections but have similar error rates.

scent techniques. Except for this work, we are not aware of any method related to kernel rules in the context of distance metric learning or learning the bandwidth of the kernel. There has been lot of work in the area of distance metric learning for k -NN classification, some of which are briefly discussed in §5. The central idea in all these methods is that similarly labeled examples should cluster together and be far away from differently labeled examples. ShalevShwartz et al. (2004) proposed an online algorithm for learning a Mahalanobis distance metric with the constraint that any training example is closer to all the examples that share its label than to any other example of different label. In addition, examples from different classes are constrained to be separated by a large margin. Though Shalev-Shwartz et al. (2004) do not solve this as a batch optimization problem, it can be shown that it reduces to an SDP (after rank relaxation) and is in fact the same as Eq. (9) except for the outer [.]+ function and the constraint 1T 1 = 0.

6. Related Work
We briefly review some relevant work and point out similarities and differences with our method. In our work, we have addressed the problem of extending kernel classification rules to arbitrary metric spaces by learning an embedding function that embeds data into Euclidean space while minimizing an upper bound on the resubstitution estimate of the error probability. The method that is closest in spirit (kernel rules) to ours is the recent work by Weinberger and Tesauro (2007) who learn a Mahalanobis distance metric for kernel regression estimates by minimizing the leaveone-out quadratic regression error of the training set. With the problem being non-convex, they resort to gradient de-

7. Concluding Remarks
In this paper, two questions related to the smoothing kernel based classification rule have been addressed. One is related to learning the bandwidth of the smoothing kernel, while the other is to extending the classification rule to arbitrary domains. We jointly addressed them by learning a function in a reproducing kernel Hilbert space while minimizing an upper bound on the resubstitution estimate of the error probability of the kernel rule. For a particular choice


Metric Embedding for Kernel Classification Rules

of the smoothing kernel, called the na¨ve kernel, we showed i that the resulting rule is related to the k -NN rule. Because of this relation, the kernel rule was compared to k -NN and its state-of-the-art distance metric learning algorithms on a supervised classification task and was shown to have comparable performance to these methods. In the future, we would like to develop some theoretical guarantees for the proposed method along with extending it to large-scale applications.

from the National Science Foundation (grant DMS-MSPA 0625409), the Fair Isaac Corporation and the University of California MICRO program.

References
Devroye, L., Gyorfi, L., & Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. New York: Springer-Verlag.  Devroye, L., & Krzyzak, A. (1989). An equivalence theorem for L1 convergence of the kernel regression estimate. Journal of Statistical Planning and Inference, 23, 71­82. Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector machines. Advances in Computational Mathematics, 13, 1­50. Globerson, A., & Roweis, S. (2006). Metric learning by collapsing classes. Advances in Neural Information Processing Systems 18 (pp. 451­458). Cambridge, MA: MIT Press. Goldberger, J., Roweis, S., Hinton, G., & Salakhutdinov, R. (2005). Neighbourhood components analysis. Advances in Neural Information Processing Systems 17 (pp. 513­520). Cambridge, MA: MIT Press. Horst, R., & Thoai, N. V. (1999). D.c. programming: Overview. Journal of Optimization Theory and Applications, 103, 1­43. ¨ Scholkopf, B., Herbrich, R., & Smola, A. J. (2001). A generalized representer theorem. Proc. of the Annual Conference on Computational Learning Theory (pp. 416­426). ¨ Scholkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press. Shalev-Shwartz, S., Singer, Y., & Ng, A. Y. (2004). Online and batch learning of pseudo-metrics. Proc. of 21st International Conference on Machine Learning. Banff, Canada. Snapp, R. R., & Venkatesh, S. S. (1998). Asymptotic expansions of the k nearest neighbor risk. The Annals of Statistics, 26, 850­878. Torresani, L., & Lee, K. (2007). Large margin component analysis. Advances in Neural Information Processing Systems 19 (pp. 1385­1392). Cambridge, MA: MIT Press. von Luxburg, U., & Bousquet, O. (2004). Distance-based classification with Lipschitz functions. Journal for Machine Learning Research, 5, 669­695. Weinberger, K. Q., Blitzer, J., & Saul, L. K. (2006). Distance metric learning for large margin nearest neighbor classification. Advances in Neural Information Processing Systems 18 (pp. 1473­1480). Cambridge, MA: MIT Press. Weinberger, K. Q., & Tesauro, G. (2007). Metric learning for kernel regression. Proc. of the 11th International Conference on Artificial Intelligence and Statistics. Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (2003). Distance metric learning with application to clustering with sideinformation. Advances in Neural Information Processing Systems 15 (pp. 505­512). Cambridge, MA: MIT Press.

Appendix A. Proof of Theorem 1
We need the following result to prove Theorem 1. Lemma 5. Let H = {f : X  R} be an RKHS with K : 2 X × X  R as its reproducing kernel. Let  : Rn  R be an arbitrary function. Then each minimizer f  H of { +  f (xi ) - f (xj )}nj =1 fH 2 (10) i, n admits a representation of nhe form f = i=1 ci K(., xi ), t where {ci }n 1  R and i=1 ci = 0. i= Proof. The proof follows the generalized representer the¨ orem (Scholkopf et al., 2001, Theorem 4). Since f  H, f (x) = f, K(., x) H. Therefore, the arguments of  in Eq. (10) are of the form { f, K(., xi ) - K(., xj ) H}nj =1 . We decompose f = f + f i, a { nd so that f span K(., xi ) - K(., xj )}nj =1  i, f , K(., xi ) - K(., xj ) H = 0,  i, j  [n]. So, f = n n i,j =1 ij (K(., xi ) - K(., xj )) + f where {ij }i,j =1  R. Therefore, f (xi ) - f (xj ) = f,nK(., xi ) - K(., xj ) H = f , K(., xi ) - K(., xj ) H = p,m=1 pm (K(xi , xp ) - K(xj , xp ) - K(xi , xm ) + K(xj , xm )). Now, consider the penalty functional, f, f H. For all f , f, f H = ||f ||2 + ||f ||2  nj =1 ij (K(., xi ) - K(., xj )) 2 . H H H i, Thus for any fixed ij  R, Eq. (10) is minimized for f = 0. Therefore, the minimizer of Eq. (10) has the n form f = i,j =1 ij (K(., xi ) - K(., xj )), which is parameterized by n2 parameters of {ij }nj =1 . By simple i, n algn bra, f reduces to f = n i=1 ci K(., xi ), where ci = e j =1 (ij - j i ) satisfies i=1 ci = 0. We are now ready to prove Theorem 1. Proof of Theorem 1. The arguments of h in Eq. (6) are of the form (Xi ) - (Xj ) 2. Consider d 2 (Xi ) - (Xj ) 2 = 2 m=1 (m (Xi ) - m (Xj )) = d 2 ) m=1 ( m , Km (., Xi ) - Km (., Xjd Hm ) . The penal2 izer in Eq. (6) reduces to  C = m=1 m 2 m . ThereH fore, applying Lemma 5 to each m , m  [d] proves the result.

Acknowledgments
We thank the reviewers for their comments which greatly improved the paper. We wish to acknowledge support