Learning Dissimilarities by Ranking: From SDP to QP

Hua Ouyang College of Computing, Georgia Institute of Technology Alex Gray College of Computing, Georgia Institute of Technology

houyang@cc.gatech.edu agray@cc.gatech.edu

Abstract
We consider the problem of learning dissimilarities between points via formulations which preserve a specified ordering between points rather than the numerical values of the dissimilarities. Dissimilarity ranking (dranking) learns from instances like "A is more similar to B than C is to D" or "The distance between E and F is larger than that between G and H". Three formulations of dranking problems are presented and new algorithms are presented for two of them, one by semidefinite programming (SDP) and one by quadratic programming (QP). Among the novel capabilities of these approaches are outof-sample prediction and scalability to large problems.

lem: dissimilarity ranking (d-ranking ). Unlike ranking, this problem learns from instances like "A is more similar to B than C is to D" or "The distance between E and F is larger than that between G and H". Note that the dissimilarities here are not necessarily distances. Other than real vectors in conventional ranking problems, the data to be ranked here are dissimilates of pairwised data vectors. This problem can be stated as: learning an explicit or implicit function which gives ranks over a space of dissimilarities d (X, X)  R. Based on different requirements of applications, this learning problem can have various formulations. We will present some of them in Section 2. D-ranking can be regarded as a special instance of dissimilarity learning (or metric learning). Different dissimilarity learning methods have different goals. We highlight some previous work as below. · In metric learning methods (Hastie & Tibshirani, 1996; Xing et al., 2002), the purpose of learning a proper Mahalanobis distance is to achieve better class/cluster separations. · In kernel learning methods (Lanckriet et al., 2004; Micchelli & Pontil, 2005), learning a proper kernel is equivalent to learning a good inner-product function which introduces a dissimilarity in the input space. The purpose is to maximize the performance of a kernel-based learning machine. · Multidimensional scaling (MDS) (Borg & Groenen, 2005) and Isomap (Tenenbaum et al., 2000) can also be regarded as learning an implicity function f : RD  RL . The purpose of learning an embedding is to preserve distances in a lowdimensional Euclidean space RL . In our d-ranking problems, the purpose of learning a proper dissimilarity is to preserve the ranks of dissimilarities, not the absolute values of them (which is the case in MDS and Isomap). For example, if we know

1. Introduction
Ranking or sometimes referred as ordinal regression, is a statistical learning problem which gained much attention recently (Cohen et al., 1998; Herbrich et al., 1999; Joachims, 2002). This problem learns from relative comparisons like "A ranks lower than B" or "C ranks higher than D". The goal is to learn an explicit or implicit function which gives ranks over an sampling space X. In most of these tasks, the sampled instances to be ranked are vector-valued data in RD , while the ranks are real numbers which can be either discrete or continuous. If the problem is to learn a real valued ranking function, it can be stated as: given a set S of pairs (xi , xj )  S (which indicates that the rank of xi is lower than xj ), learn a real valued f : X  R that satisfies f (xm ) < f (xn ) if the rank of xm is lower than xn . In this paper we investigate a special ranking probAppearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Learning Dissimilarities by Ranking: From SDP to QP

that "The distance between A and B is smaller than that between C and D", the problem can be formulated as: find a dissimilarity function d, such that d(A, B ) < d(C, D). Unlike conventional learning and ranking problems, dranking hasn't received intensive studies in previous research. One of the most important related work is the nonmetric multidimensional scaling (NMDS) (Borg & Groenen, 2005). Given a symmetric proximity (similarity or dissimilarity) matrix  = [mn ], NMDS tries to find a low dimensional embedding space RL such that xi , xj , xk , xl  RL , xi - xj 2 < 2 xk - xl 2  ij < kl . NMDS was recently ex2 tended to the generalized NMDS (GNMDS) (Agarwal, 2007). GNMDS does not need to know the absolute values of proximities mn . Instead it only need a set S of quadruples (i, j, k , l)  S , which indicate that ij < kl . Both NMDS and GNMDS learn an embedding space instead of learning an explicit ranking function, thus they are unable to handle out-of-sample problems. Schultz et. al. gave a solution to these problems by proposing to learn a distance metric from relative comparisons (Schultz & Joachims, 2003). They choose to learn a Mahalanobis distance which can preserve ranks of distances. Since the learned distance functions are parameterized, they can be used to handle new samples. The proposed formulation was solved in a similar manner as SVM. Nonetheless, the regularization term was not well justified. Many applications in biology, computer vision, web search, social science etc. can be put into the framework of d-ranking problems. Take document classification as an instance. Without adequate domain knowledge, it is hard to accurately determine the quantitative dissimilarities between two documents. However, comparing the dissimilarities between every three or four documents can be easily done, either automatically or manually. Generally speaking, d-ranking is especially useful when the quantized dissimilarities are not reliable. In Section 2, we propose three formulations of dranking problems. Section 3 gives the numerical solutions for solving d-ranking by SDP. Section 4 shows how to solve d-ranking by QP. The proposed methods are evaluated in Section 5. Section 6 concludes the paper.

cations. Next we will give three formulations. Formulation 2.1. (F1) Inputs: a set S of ordered quadruples (i, j, k , l)  S , indicating that d(xi , xj )  d(xk , xl ), where d(·, ·) is a fixed but unknown dissimilarity function; Outputs: coefficients of embedded i j k l samples x , x , x , x  RL ; Criteria: (i, j, k , l)  j k l i S  x - x 2  x - x 2. 2 2 As proposed by Agarwal et. al. (Agarwal, 2007), in F1 we neither assume any geometry of the input space, nor assume any form of dissimilarities in it. We do not need to know the coefficients of input samples. Only ordering information is provided. Nonetheless we assume a Euclidean metric in the embedding space, which is often of low dimensions (e.g. L = 2, or 3). As shown in Section 3, F1 can be formed as a problem of semidefinite programming (SDP). Formulation 2.2. (F2) Inputs: a set S of ordered quadruples (i, j, k , l)  S , indicating that d(xi , xj )  d(xk , xl ), where d(·, ·) is a fixed but unknown dissimilarity function; corresponding coefficients in the input Euclidean space xi , xj , xk , xl  RD ; Outputs: dissim^ ilarity functions d(·, ·) : RD × RD  R; Criteria: ^ ^ (i, j, k , l)  S  d(xi , xj )  d(xk , xl ). Unlike learning an embedding space as in F1, F2 ^ learns an explicit dissimilarity function d(·, ·) which preserves the ranks of dissimilarities. We will show in Section 4 that F2 can be handled in a very similar manner as support vector machines, where the quadratic programming (QP) problem can be solved efficiently by specialized sequential optimization methods. If in some cases we need to find a low dimensional Euclidean embedding of the input samples, we can then use the classical multidimensional scaling (MDS) to preserve the learned dissimilarities. Formulation 2.3. (F3) Inputs: a set S of ordered quadruples (i, j, k , l)  S , indicating that d(xi , xj )  d(xk , xl ), where d(·, ·) is a fixed but unknown dissimilarity function; corresponding coefficients in the input Euclidean space xi , xj , xk , xl  RD ; Outputs: proi j k l jection function f : RD  RL , x , x , x , x  RL ; i j k l Criteria: (i, j, k , l)  S  x - x 2  x - x 2 . 2 2 Although we formulate F3 as a function learning problem, currently we have not found any efficient method to solve it. This formulation will remains as our future work.

2. Three Formulations of D-Ranking
D-ranking problems can have various formulations depending on specific requirements or settings of appli-

3. Solving F1 by SDP
F1 was studied by Agarwal et. al. (Agarwal, 2007). The authors proposed GNMDS which can be solved as


Learning Dissimilarities by Ranking: From SDP to QP

a SDP, as shown in Eq.(1). GNMDS: i min ij kl + tr(K),
j klS

500 10 10 300 1 200 1 6 100 6 3 0 4 -100 2 -200 8 7 7 34 5 5 2

400

true locations GNMDS

s.t. (Kkk - 2Kkl + Kll ) - (Kii - 2Kij + Kj j ) + ij kl  1, a Kab = 0, ij kl  0, K 0,
b

(1)

-300 9

8

-400 -500 -600 -400 -200

9

0

200

400

600

for all (i, j, k , l)  S . The main idea of GNMDS is to learn a positive semidefinite Gram matrix K = X T X which can be eigen-decomposed to recover the embedded samples. The relation between Euclidian distances and the Gram matrix is used: xi - xj 2 = Kii - 2Kij + Kj j . 2 (2)

Figure 1. d-ranking toy problem: locations of ten European cities. Purple: true locations. Green: locations recovered by GNMDS. All the coefficients have been scaled before plotting.

1.5

Eigenval ues

1

0.5

Nonetheless, the constraints that contain order information of dissimilarities are not sufficient to determine a unique K , since any rotation, translation or scaling can also satisfies these ca nstraints. To reduce these o ambiguities, they use b Kab = 0 to center all the embedded samples at the origin. It is preferable in many applications to find low dimensional embedding spaces, e.g. 2D or 3D Euclidean space. Thus a low-rank K is desired. Unfortunately, minimizing rank(K ) sub ject to linear inequality constraints is NP-Hard (Vandenberghe & Boyd, 1996). Thus the ob jective function is relaxed heuristically as minimizing trace(K ), which is a convex envelope of the rank. Figure 1 shows the result of GNMDS on a toy problem. The inputs are 990 pairwise ranks of distances between 10 European cities. The outputs are the recovered 2D coefficients. It can be observed that the recovered locations of the cities do not correspond to the true locations. Actually only 789 out of 990 pairs of ranks are preserved by the learned 2D embedding, i.e. 20.3% error rate. Figure 2 shows the 10 sorted eigenvalues of the Gram matrix K . Although the original space is a 2D Euclidean space, the first 2 eigenvalues only account for 49.4% of the total variation. There are at least two reasons that account for the poor performance of GNMDS. Firstly, there is no guarantee on the quality of the solution of the relaxed problem compared with the original problem. There may exist some higher dimensional spaces which satisfy all the constants while have smaller traces than lower dimensional spaces. Secondly, due to the introduction of

0

0

1

2

3

4

5

6

7

8

9

10

11

Figure 2. 10 sorted eigenvalues of the Gram matrix K learned by GNMDS.

slack variables ij kl in the inequality constraints, the learned embedding tends to push all the samples to the same point. The first problem can be solved by introducing better heuristics of the convex envelope of the rank. The second problem can be solved by the following slight modification of GNMDS: Modified GNMDS: i min ij kl + tr(K),
j klS

s.t. (Kkk - 2Kkl + Kll ) - (Kii - 2Kij + Kj j ) - ij kl  1, a Kab = 0, ij kl  0, K 0,
b

(3)

for all (i, j, k , l)  S . The modified GNMDS just changes the slack variables from +ij kl to -ij kl . This simple trick can ensure that all the differences between distances k , l and i, j are larger than 1, thus pulls the embedding samples apart. Figure 3 shows toy problem solved by the modified GNMDS. The recovered samples are closer to the true locations than those in Figure 1. There are 850 out of 990 pairs of ranks correctly preserved, i.e. 14.14% error rate, which is 6% lower than GNMDS. Figure


Learning Dissimilarities by Ranking: From SDP to QP
500

400 10 300

10

true locations modified GNMDS

tion problem: d-ranking-VM (primal): 1i min ij kl +  d 2 , H N
j klS

1 200 51 100 6 3 0 3 -100 4 4 7 7 -200 -300 82 8 9 9 5 2 6

-400 -500 -600 -400 -200

0

200

400

600

s.t. d (xk , xl ) - d (xi , xj ) - ij kl  1, ij kl  0, for all (i, j, k , l)  S .

(4)

Figure 3. d-ranking toy problem: locations of ten European cities. Purple: true locations. Green: locations recovered by modified GNMDS. All the coefficients have been scaled before plotting.

where N = |S |. H is a hyper-reproducing kernel Hilbert space (hyper-RKHS) from which the function d : X × X  R is drawn. Like the representer theorem in RKHS (Kimeldorf & Wahba, 1971), there is also a representer theorem in hyper-RHKS (see (Ong et al., 2005) or (Kondor & Jebara, 2006) for the theorem and proofs): d(x) = pM
=1

4 shows the 10 eigenvalues of K learned by modified GNMDS. The first 2 eigenvalues account to 69.8% of the total variant, which is 20% higher than GNMDS.
50 40

cp K (xp , x),

(5)

Eigenval ues

30 20 10 0

0

1

2

3

4

5

6

7

8

9

10

11

Figure 4. 10 sorted eigenvalues of the Gram matrix K learned by modified GNMDS.

where K is a semidefinite hyperkernel, x denotes a pair of samples (xi , xj ), and M is the number of distinct dissimilarity pairs provided by the training rank data S . We denote the set of dissimilarity pairs as D, and M = |D|. Normally we have M > N (the discussion of M and N is given in Section 5.1). Substitute Eq.(5) into (4), we can change the primal problem to the following form: 1p p + C T K C, min N
S

4. Solving F2 by QP
As introduced in Section 2, instead of learning an embedding as in F1, a dissimilarity function d(xi , xj ) : X × X  R is learned in F2, such that all the training ranks between d(·, ·) are preserved, and can generalize to new samples. This is indeed a dissimilarity learning problem. Many previous metric learning methods (Hastie & Tibshirani, 1996; Goldberger et al., 2004; Kwok & Tsang, 2003) try to learn an alternative dissimilarity function by replacing the Euclidean metric with an properly learnt Mahalanobis metric, either globally or locally. In this section we propose the d-ranking Vector Machine (d-ranking-VM ) method. Unlike metric learning methods, d-ranking-VM is explicitly regularized. Thus we can have a full control over the complexity of d(xi , xj ). D-ranking-VM utilizes the technique of hyperkernel learning (Ong et al., 2005) which was originally proposed for learning a proper kernel. D-ranking-VM is formulated as the following optimiza-

s.t.

pM
=1

cp K (xp ; xk , xl ) -

pM
=1

cp K (xp ; xi , xj )

(6)

-  p  1, p  0, for all p  S , where C  RM is a vector with the ith element being cp , and K  RM ×M is the hyper-kernel matrix. The dual problem of (6) can be derived by using the Lagrangian technique. The solution to this optimization problem is given by the saddle point of the Lagrangian function: L(C, p , p , p ) = pN
=1

1p N

p + C T K C -
S

pN
=1

p p ,

+

p

M s
=1

cs [K (xs ; xi , xj ) - K (xs ; xi , xj )] + 1 + p (7)


Learning Dissimilarities by Ranking: From SDP to QP

where p and p are non-negative Lagrange multipliers. The primal problem is convex, thus there exist a strong duality between the primal and the dual. Utilizing the KKT optimality condition, we have: L = 2K C + K (P - Q)A = 0, C and 1 L = + p -  p = 0 ,  p N (8)

k   1 2 ,  > 0, a (x1 , x2 ) kb (x , x ) or ka (x1 , x2 ) + 1 2  kb (x , x ) can give a hyperkernel k . Proof. See appendix. Example 4.2. (Gaussian symmetric product hyperkernel) Let ka and kb be the same Gaussian RBF - x-x 2 , ) kernel k (x, x = exp and let  =  = 1, 2 2 the Gaussian symmetric product hyperkernel is given by: ( = 1 2 1 2 k x1 , x ), (x2 , x ) k (x1 , x )k (x2 , x ) - (13) 1 2 x1 - x 2 + x2 - x 2 = exp 2 2 Example 4.3. (Gaussian symmetric sum hyperkernel) Under the same conditions as Example 4.2, we can construct the Gaussian symmetric sum hyperkernel as: ( = 1 2 1 2 k x1 , x ), (x2 , x ) k (x1 , x ) + k (x2 , x ) - + - ( 1 2 x1 - x 2 x2 - x 2 = exp exp 2 2 2 2 14) Example 4.4. (polynomial symmetric product hyperkernel) Let xa and kb pbe the same polynomial k ) + kernel k (x, x = ,x q , and let  =  = 1, we can construct the polynomial symmetric product hyperkernel as: ( =x1 p p 1 2 2 k x1 , x ), (x2 , x ) +q x2 , x + q 1, x (15) Example 4.5. (polynomial symmetric sum hyperkernel) Under the same conditions as Example 4.4, we can construct the polynomial symmetric sum hyperkernel as: ( =x1 p x2 p 1 2 k x1 , x ), (x2 , x ) +q + +q 1, x 2, x (16)

(9)

where A  RN is a vector with the pth element being p . P, Q  RM ×N are two matrices with contain the rank information. Each column px· (x = 1 . . . N ) of P and each column qx· (x = 1 . . . N ) of Q only contain one 1 and M - 1 0s. For example, if the rth training quadruples in S is (i, j, k , l), which means that d(xi , xj ) < d(xk , xl ), and if the pair (i, j ) is the mth element in D, while (k , l) is the nthe element in D, then prm = 1 and qrn = 1. From Eq.(8) and (9) we have: C= and (Q - P )A , 2 (10)

1 (11) N Substitute Eq.(10) and (11) into (6), we arrive at the following dual problem for d-ranking-VM: p = p - d-ranking-VM (dual): max pN
=1

p -

AT (Q - P )T K (Q - P )A , 4

(12)

s.t. p  0, for all (i, j, k , l)  S . This problem is a quadratic programming (QP) problem which shares a similar form as SVM. Thus the sequential optimization techniques of SVM can be readily employed for d-ranking-VM. To perform testing, we can use the learnt dissimilarity function in Eq.(5) and make pairwise comparisons. An important problem for kernel learning methods is the selection of proper kernels. This problem also exists in hyperkernel learning methods. Here we propose some examples of hyperkernels, which are hyperextensions of Gaussian RBF kernels and polynomial kernels. The construction of these hyperkernels are based on the following proposition. Prop osition 4.1. Let ka (·, ·) and kb (·, ·) be posi1 2 tive definite kernels, then x1 , x , x2 , x  X, and

5. Exp eriments
The proposed d-ranking methods in Section 3 and Section 4 are evaluated by several experiments. 5.1. Obtaining Ranks of Pairs from Data To test our methods, we need to obtain pairwise distance ranks. This can be done in many ways. Generally speaking, for a problem of n data samples, the 2 total number of available distance pairs are M = Cn = n(n-1) (if we take d(xi , xj ) and d(xj , xi ) as the same 2


Learning Dissimilarities by Ranking: From SDP to QP

distance). The total number of pairwise distance ranks 4 3 2 2 are N = CM = M (M -1) = n - n - n + n (if we 2 8 4 8 4 take d(xi , xj ) < d(xk , xl ) and d(xk , xl ) > d(xi , xj ) as the same rank pair). Table 1 gives some examples of the relation between n and N . When n grows, the
Table 1. Some examples of N v.s. n.

1500

1000

500

0

-500

-1000

-1500

n 2 3 4 10 20 50 100 1000 N 0 3 15 990 17955 749700 12248775 1.2475e+11 number of rank constraints will increase dramatically. Even solving a problem of n > 100 will be impossible for some optimization solvers. Here we reduce N by considering order transitivities, i.e. if A > B and B > C , then the rank pair A > C can be ignored (automatically satisfied) in the optimization constraints. The method is very simple. Firstly we sort the M distances decreasingly. Then we take the adjacent two distances to form one distance rank 2 pair. By doing this, N can be reduced to n - n - 1. 2 2 This is the maximum number of N which carries full rank information of all the distances. Of course in some applications, the rank information is not be fully 2 given, and N < n - n - 1. 2 2 We test our method on three data sets: 1) 2D locations of 109 largest cities in the continental US; 2) 100 images of handwritten digits "3" and "5" from the USPS database, each of size 16 × 16; 3) 126 face images of 4 people from the UMist database, each of size 112 × 92. For GNMDS and modified GNMDS methods, all ranks of distance pairs are fed to a SDP solver, and the recovered 2D embeddings are plotted. We used SeDuMi (Strum, 1999) to get the results given in the following subsection. For d-ranking-VM, LibSVM (Chang & Lin, 2001) is employed as our QP solver. It is implemented by employing the sequential minimal optimization (SMO) technique. The learned dissimilarities are used as a "pseudo-distances", and are fed to the classical MDS. The recovered 2D embeddings are then plotted. 5.2. Results of US Cities In this data set, n = 109 and N = 5885. Every location of the cities is given by a 2D coefficient. Figure 5 shows the true locations. Figure 6 shows the results given by GNMDS. It can be observed that GNMDS cannot correctly recover the embedding based on distance ranks. Most of the embedded samples are pushed to a line. 50.3% of

-2500

-2000

-1500

-1000

-500

0

500

1000

1500

2000

2500

Figure 5. Locations of 109 largest cities in the continental United States.

200

100

0

-100

-200

-300

-400 -2500

-2000

-1500

-1000

-500

0

500

1000

1500

2000

2500

Figure 6. Recovered locations of 109 US cities given by GNMDS.

the distances ranks are preserved, which are the results of randomness. This gives an evidence to the analysis in Section 3. Figure 7 shows the result given by the modified GNMDS. The recovered embedding roughly reflects the geometry of the cities. 74.5% of the distances ranks have been preserved. Since the distance information is not provided, there is no hope to match the true locations exactly. Figure 8 shows the results given by d-ranking-VM, where  = 10, and the Gaussian symmetric product hyperkernel is used, with  = 15. 97.9% of the distances ranks are preserved. Table 2 shows the runtime of the above experiments.
Table 2. Runtime comparison for the three d-ranking methods.

Method GNMDS modified GNMDS d-ranking-VM

Runtime(minutes) 124 71 4.5


Learning Dissimilarities by Ranking: From SDP to QP
2000

1500

1000

500

0

-500

-1000

-1500

-2000 -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500

Figure 7. Recovered locations of 109 US cities given by the modified GNMDS. Figure 9. Recovered locations of USPS handwritten digits "3" and "5" given by d-ranking-VM.

1500

1000

500

0

-500

-1000

-1500 -2500

-2000

-1500

-1000

-500

0

500

1000

1500

2000

2500

Figure 8. Recovered locations of 109 US cities given by the d-ranking-VM, using Gaussian symmetric product hyperkernel.

Figure 10. Recovered locations of UMist human faces given by d-ranking-VM.

5.3. Results of USPS Handwritten Digits In this data set, n = 100 and N = 4949. The dimension every data sample is 16 × 16 = 256. Figure 9 shows the recovered 2D results given by RnakD-VM. 5.4. Results of UMist Human Faces In this data set, n = 126 and N = 7874. The dimension every data sample is 112 × 92 = 10304. Figure 10 shows the recovered 2D results given by RnakD-VM. modified version): It can recover low dimensional embedding directly. Only ordering information is needed. There is no need to know the values of sample coefficients. · Cons for d-ranking by SDP (GNMDS and the modified version): Solving SDP is hard, especially for large scale problems. Even sophistries SDP solver can only solve N < 103 problems with the number of constraints less than 105 . It cannot be used to predict unseen samples. · Pros for d-ranking by QP (d-ranking-VM): Solving QP in our case is much easier than SDP, since it can be converted to a similar form as SVM. Using sequential methods (SMO), can solve N > 105 problems. The learn dissimilarity function can be used to predict unseen samples. · Cons for d-ranking by QP (d-ranking-VM): It can-

6. Discussions and Conclusions
We have presented three d-ranking formulations, and give numerical solutions for two of them, namely solving d-ranking by SDP and solving d-ranking by QP. Each of them has its advantages and shortcomings. We list some pros and cons from different perspectives: · Pros for d-ranking by SDP (GNMDS and the


Learning Dissimilarities by Ranking: From SDP to QP

not recover low-dimensional embedding explicitly. One needs to use MDS or other embedding methods after learning the dissimilarities. Learning dissimilarity measure needs to know the coefficients of original samples. Like kernel methods, how to choose a good hyperkernel is crucial in solving a specific problem. To our knowledge, this is the first work which brings out-of-sample prediction capability and large-scale scalability to d-ranking problems. Note that the technique of d-ranking-VM can also be employed in solving distances preserving problems. We will investigate the regularization properties and evaluate the performances of different hyperkernels in the following research. Finding a numerical solution for formulation F3 will also be our future work.

References
Agarwal, S. (2007). Generalized non-metric multidimensional scaling. AISTATS 07. Borg, I., & Groenen, P. J. (2005). Modern multidimensional scaling: Theory and applications. New York: Springer. Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http : //www.csie.ntu.edu.tw/  cj lin/libsv m. Cohen, W. W., Schapire, R. E., & Singer, Y. (1998). Learning to order things. NIPS. Goldberger, J., Roweis, S., Hinton, G., & Salakhutdinov, R. (2004). Neighbourhood Components Analysis. NIPS. Hastie, T., & Tibshirani, R. (1996). Discriminant adaptive nearest neighbor classification. IEEE Trans. PAMI, 18, 607­616. Herbrich, R., Graepel, T., & Obermayer, K. (1999). Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, 115­132. Joachims, T. (2002). Optimizing search engines using clickthrough data. Proc. of SIGKDD '02. Kimeldorf, G., & Wahba, G. (1971). Some Results on Tchebycheffian Spline Functions. Journal of Mathematical Analysis and Applications, 33, 82­75. Kondor, R., & Jebara, T. (2006). Gaussian and Wishart Hyperkernels. NIPS. Kwok, J. T., & Tsang, I. W. (2003). Learning with Idealized Kernels. ICML.

7. App endix: Pro of of Prop osition 4.1
We need to prove that k is a kernel on X2 . a Denote s the Kronecker product of two matrices A  Rm×n and B  Rp×q : B a11 B . = . . am1 B and define B a s: ··· .. . ··· a 1n 1 + B  . . , . amn 1 + B   ··· .. . ···  a1n B . , . . amn B

A

(17)

A

a11 1 + B  . . = . am1 1 + B


(18)

Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L., & Jordan, M. I. (2004). Learning the Kernel Matrix with Semidenite Programming. Journal of Machine Learning Research, 5, 27­72. Micchelli, C. A., & Pontil, M. (2005). Learning the Kernel Function via Regularization. Journal of Machine Learning Research, 6, 1099­1125. Ong, C. S., Smola, A. J., & Williamson, R. C. (2005). Learning the Kernel with Hyperkernels. Journal of Machine Learning Research, 6, 1043­1071. Schultz, M., & Joachims, T. (2003). Learning a Distance Metric from Relative Comparisons. NIPS. Strum, J. F. (1999). Using SeDuMi 1.02, A Matlab toolbox for optimization over symmetric cones. Optimization Methods and Software, 11. Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290, 2319­2323. Vandenberghe, L., & Boyd, S. (1996). Semidefinite Programming. SIAM Review, 38, 49­95. Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (2002). Distance Metric Learning with Application to Clustering with Side-Information. NIPS.

where 1 is a matrix of all ones. Denote the kernel matrix of ka (x1 , x2 ) as Ka , and that 2 1 of kb (x , x ) as Kb . Denote the hyperkernel matrix of k as K . It is easy to verify that we can construct K = K Ka b . Since Ka and Kb are p ositive definite, their eigenvalues µa and µb are positive. Thus the eigenvalues of K : vij =  µai µbj are also positive. A symmetric matrix K with positive eigenvalues is positive defk   1 2 inite. Thus k = a (x1 , x2 ) kb (x , x ) is a valid hyperkernel.  1 We caK also verify that Ka n Kb = Ka + 1 . Since Ka , Kb and 1 are all positive semidefb  Kb is positive inite and ,  > 0, K = Ka 1 2 semidefinite. Thus k = ka (x1 , x2 ) +  kb (x , x ) is a valid hyperkernel.