Active Kernel Learning Steven C.H. Hoi School of Computer Engineering, Nanyang Technological University, Singapore Rong Jin Department of Computer Science and Engineering, Michigan State University chhoi@ntu.edu.sg rongjin@cse.msu.edu Abstract Identifying the appropriate kernel function/matrix for a given dataset is essential to all kernel-based learning techniques. A number of kernel learning algorithms have been proposed to learn kernel functions or matrices from side information (e.g., either labeled examples or pairwise constraints). However, most previous studies are limited to "passive" kernel learning in which side information is provided beforehand. In this paper we present a framework of Active Kernel Learning (AKL) that actively identifies the most informative pairwise constraints for kernel learning. The key challenge of active kernel learning is how to measure the informativeness of an example pair given its class label is unknown. To this end, we propose a min-max approach for active kernel learning that selects the example pair that results in a large classification margin regardless of its assigned class label. We furthermore approximate the related optimization problem into a convex programming problem. We evaluate the effectiveness of the proposed algorithm by comparing it to two other implementations of active kernel learning. Empirical study with nine datasets on semi-supervised data clustering shows that the proposed algorithm is more effective than its competitors. tern recognition, information retrieval, computer vision, and bioinformatics, etc. Since the choice of kernel functions or matrices is often critical to the performance of many kernel-based learning techniques, it becomes a more and more important research problem for how to automatically learn a kernel function/matrix for a given dataset. Recently, a number of kernel learning algorithms (Chapelle et al., 2003; Cristianini et al., 2002; Hoi et al., 2007; Kondor & Lafferty, 2002; Kulis et al., 2006; Lanckriet et al., 2004; Zhu et al., 2005) have been proposed to learn kernel functions or matrices from side information. The side information can be provided in two different forms: either labeled examples or pairwise constraints. In the latter case, two types of pairwise constraints are examined in the previous studies: a must-link pair where two examples should belong to the same class, and a cannot-link pair where two examples should belong to different classes. In this study, we fo cus on kernel learning with pairwise constraints. Most kernel learning metho ds, termed as "passive kernel learning", assume that labeled data is provided beforehand. However, given the labeled data may be expensive to acquire, it is more cost effective if we are able to identify the most informative example pairs such that the kernel can be learned efficiently with only a small number of pairwise constraints. To this end, we fo cus on active kernel learning (AKL) whose goal is to identify the example pairs that are informative to the target kernels. We extends our previous work on non-parametric kernel learning (Hoi et al., 2007) to active kernel learning. As shown in (Hoi et al., 2007), the parametric approaches for kernel learning are often limited by their capacity in fitting diverse patterns of real-world data, and therefore are not as effective as the non-parametric approach for kernel learning. The simplest approach toward active kernel learning is to measure the informativeness of an example pair by its kernel similarity. Given a pair of examples (xi , xj ), we assume that Ki,j , the kernel similarity between xi and xj , is a large positive number when xi and xj are in the same class, and a large negative number 1. Introduction Kernel metho ds have attracted more and more attention of researchers in computer science and engineering due to their superior performance in data clustering, classification, and dimensionality reduction (Scholkopf & Smola, 2002; Vapnik, 1998). Kernel metho ds have been applied to many fields, such as data mining, patApp earing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Active Kernel Learning class-1 class-2 must-link cannon-link 1 0.5 0 10 (a) (b) 1 0.5 0 10 0.58 1 0.5 0 10 (c) 0.86 0.51 5 5 0 0 -5 0 -10 -10 -5 -10 -10 -5 10 5 -5 0 5 10 5 0 10 -5 0 -10 -10 -5 5 Figure 1. Examples of active kernel learning: (a) double-spiral artificial data with some given pairwise constraints, (b) AKL with the least |Ki,j |, (c) the prop osed AKL method. The right bars show the resulting clustering accuracies using kernel k-means clustering methods. when they are in different classes. Thus, by following the uncertainty principle of active learning (Tong & Koller, 2000; Hoi et al., 2006), the most informative example pairs should be the ones whose kernel similarities are closest to zero. In other words, the criterion is to select the example pair with the least absolute kernel similarity (i.e., |Ki,j |). Unfortunately, this simple approach may not always be effective in obtaining informative pairwise constraints for kernel learning. Figure 1 illustrates an example of active kernel learning for data clustering. In this example, Figure 1(a) shows an artificial dataset of two classes together with a few pairwise constraints. Figure 1(b) shows the pairwise constraints with the least |Ki,j |. We observe that most of them are must-link pairs with two data points separated by a mo dest distance. Since must-link constraints are not informative to the clustering boundary, a relatively small improvement is observed in clustering accuracy (from 51% to 58%) when using the kernel learned by this simple approach. In contrast, as shown in Figure 1(c), the proposed approach for active kernel learning is able to identify a pool of diverse pairwise constraints, including both must-links and cannot-links. The clustering accuracy is increased significantly, from 51% to 86%, by using the proposed active kernel learning. The rest of this paper is organized as follows. Section 2 presents the min-max framework for our active kernel learning method, in which the problem is formulated into a convex optimization problem. Section 3 describes the results of the experimental evaluation. Section 4 concludes this work. example pairs that are most informative to the target kernel. In this section, we will first briefly review the non-parametric approach for kernel learning in (Hoi et al., 2007), followed by the description of the minmax framework for active kernel learning. 2.1. Non-parametric Kernel Learning Let the entire data collection be denoted by U = (x1 , x2 , . . . , xN ) where each data point xi Rd is a vector of d elements. Let S RN ×N be a symmetric matrix where each Si,j 0 represents the similarity between xi and xj . Unlike the kernel similarity matrix, S do es not have to be positive semi-definite. For the convenience of presentation, we set Si,i = 0 for all the examples. Then, according to (Hoi et al., 2007), a normalized graph Laplacian L is constructed using the similarity matrix S as follows: L = (1 + )I - D-1/2 S D-1/2 where D = diag(d1 , d2 , . . . , dN ) is a diagonal matrix N with di = j =1 f (xi , xj ). A small > 0 is introduced to prevent L from being singular. Let's denote by T the set of pairwise constraints. We construct a matrix T RN ×N to represent the pairwise constraints in T , i.e., +1 (xi , xj ) is a must-link pair in T -1 (xi , xj ) is a cannot-link pair in T = 0 otherwise Ti,j 2. Active Kernel Learning Our work extends the previous work on nonparametric kernel learning (Hoi et al., 2007) by intro ducing the component of actively identifying the Given the similarity matrix S and the pairwise constraints in T , the goal of kernel learning is to identify a kernel matrix Z RN ×N that is consistent with both T and S . Following (Hoi et al., 2007), we formulate it Active Kernel Learning into the following convex optimization problem: c( 2,j arg min tr(LZ ) + i 2 Z, i,j )T (1) s. t. (i, j ) T , Zi,j Ti,j 1 - i,j , i,j 0 Z0 measures the overall classification accuracy with the additional example pair (xk , xl ) labeled by y . To further measure the informativeness of example pair (xk , xl ), we intro duce the quantity (k , l) as follows (k , l) = y {-1,+1} max ((k , l), y ) (3) The first term in the above ob jective function plays a similar role as the manifold regularization (Belkin & andd P. Niyogi, 2004), where the graph Laplacian is used to regularize the classification results. The second term in the above measures the inconsistency between the learned kernel matrix Z and the given pairwise constraints. Note that unlike the formulation in (Hoi et al., 2007), we change i,j in the loss function to 2,j . i This mo dification is specifically designed for active kernel learning, and the reason will be clear later. It is not difficult to see that the problem in (1) is a semidefinite programming problem, and therefore can be solved by the standard software package, such as SeDuMi (Sturm, 1999). 2.2. Min-max Framework for Active Kernel Learning The simplest approach toward active kernel learning is to follow the uncertainty principle of active learning, and to select the example pair (xi , xj ) with the least |Zi,j | 1 . However, as already discussed in the intro duction section, the key problem with this simple approach is that the example pairs with the least |Zi,j | may not necessarily be the the most informative ones, and therefore may not result in an efficient learning of the kernel matrix. To address this problem, we propose a min-max framework for active kernel learning that measures the informativeness of an example pair by how significantly the selected example pair will affect the target kernel matrix. / Consider an unlabeled example pair (xk , xl ) T . To measure how this example will affect the kernel matrix, we consider the kernel learning problem with the additional example pair (xk , xl ) labeled by y {-1, +1}, i.e., c( c min tr(LZ ) + 2,j + 2 ,l (2) i Z, 2 2k i,j )T Clearly, (k , l) measures the worst classification error with the addition of example pair (xk , xl ). Overall, (k , l) measures how the example pair (xk , xl ) will affect the overall ob jective function, which indirectly measures the impact of the example pair on the target kernel matrix. To see this, consider an example pair (xk , xl ) that is highly consistent with the current kernel Z with label y (i.e., Zk,l y 1). According to the definition (k , l), we would expect a large (k , l) for pair (xk , xl ). This is because by assigning a label -y to example pair (xk , xl ), we expect a large classification error and therefore large (k , l). Hence, we use (k , l) to measure the uninformativeness of example pairs, i.e., the smaller (k , l), the less informative the example pair is. Therefore, the most informative example pair is found by minimizing (k , l), i.e., (k , l) = arg min (k,l)T t{-1,+1} max ((k , l), t) (4) Directly solving the min-max optimization problem in (4) is challenging because function ((k , l), t) is defined implicitly by the optimization problem in (2). The following theorem allows us to significantly simplify the optimization problem in (4) Theorem 1. The optimization problem in (4) is equivalent to the fol lowing optimization problem c( c min tr(LZ ) + 2,j + 2 ,l (5) i Z,,(k,l)T / 2 2k i,j )T s. t. Ti,j Zi,j 1 - i,j , i,j 0, (i, j ) T k,l 1 + |Zk,l |, Z 0 Proof. The above theorem follows the fact that the solution y {-1, +1} maximizing ((k , l), y ) is y = -sign(Zk,l ). This fact allows us to remove the maximization within (4) and obtain the result in the theorem. The following corollary shows that the approach of selecting the example pair with the least |Zk,l | indeed corresponds to a special solution for the problem in (5). Corollary 2. The optimal solution to (5) with fixed kernel matrix Z is the example pair with the least |Zk,l |, i.e., (k , l) = arg min |Zk,l | ( k , l) T / s. t. Ti,j Zi,j 1 - i,j , (i, j ) T y Zk ,l 1 - k ,l , Z 0 Let us denote by ((k , l), y ) the optimal value of the above optimization problem. Intuitively, ((k , l), y ) Here we assume that Zi,j > 0 when xi and xj are likely to share the same class, and Zi,j < 0 when xi and xj are likely to b e assigned to different classes 1 Active Kernel Learning Proof. By fixing Z , the problem in (5) is simplified as ( k , l) T / min 2 ,l k s. t. k,l 1 + |Zk,l | It is easy to see that the optimal solution to the above problem is the example pair with the least |Zk,l |. Note that a similar observation is described in the study (Chen & Jin, 2007) for standard active learning. 2.3. Algorithm The straightforward approach toward the optimization problem in (5) is to try out every example pair / (xk , xl ) T . Evidently, this approach will not scale well when the number of example pairs is large. Our first attempt toward solving the problem (5) is to turn it into a continuous optimization problem. To this purpose, we intro duce variable pk,l 0 to represent the probability of selecting the example pair (k , l) T . Using this notation, we have the optimization problem in (5) rewritten as c( c( tr(LZ ) + 2,j + pk,l 2 ,l (6) min i k Z 0,p, 2 2 i,j )T k,l)T The above relaxation is based on the property that a harmonic mean is no larger th( n an arithmetic mean. a By replacing the constraint k,l)T pk,l 1 with (7), / we have (6) relaxed into the following optimization problem c( c( min tr(LZ ) + 2,j + pk,l 2 ,l (7) i k Z 0,p, 2 2 i,j )T k,l)T s. t. Ti,j Zi,j 1 - i,j , i,j 0, (i, j ) T k,l - 1 Zk,l 1 - k,l , (k , l) T ( p-,1 m2 , 0 pk,l 1, (k , l) T kl k,l)T By defining variable hk,l = p-,1 , we have kl Z 0,h, min tr(LZ ) + c( 2 2,j + i i,j )T c( 2 k,l)T 2 ,l k hk , l (8) s. t. Ti,j Zi,j 1 - i,j , i,j 0, (i, j ) T k,l - 1 Zk,l 1 - k,l , (k , l) T ( hk,l m2 , hk,l 1, (k , l) T k,l)T s. t. Ti,j Zi,j 1 - i,j , (i, j ) T k,l - 1 Zk,l 1 - k,l , (k , l) T ( pk,l 1, pk,l 0, (k , l) T k,l)T Notice that constraint 0 pk,l 1 is transferred into hk,l 1. The following theorem shows the property of the formulation in (8) Theorem 4. We have the fol lowing properties for (8) · (8) is a semi-definite programming (SDP) problem. · Any feasible solution to (8) is also a feasible solution to (5) with pk,l = h-,1 , and the optimal value kl for (6) is upper bounded by that for (8). The pro of is provided in Appendix B. Note that using 2,j instead of i,j for the loss function is key to i turning (6) into a convex optimization problem. The second property stated in Theorem 4 indicates that by minimizing (8), we guarantee a small value for the ob jective function in (6). The following theorem shows the dual problem of (8), which is the key to the efficient computation. Theorem 5. The dual problem of (8) is - + Q | 2 ( ( Wk,l Q2,j i Wk,l | - max i,j - Q,W 2c 2c i,j )T k , l) T / The following theorem shows the relationship between (6) and (5). Theorem 3. Any global optimal solution to (5) is also a global optimal solution to (6). The proof of the above theorem can be found in Appendix A. Unfortunately, the optimization problem in (6) is nonconvex because of the term pk,l 2 ,l . It is therefore k difficult to find the global optimal solution for (6). In order to turn (6) into a convex optimization problem, ( we view the constraint k,l)T pk,l 1 as a b ound / for the arithmetic mean of pk,l , i.e., 1( m pk ,l k , l) T / 1 m s. t where m = |{(k , l)|(k , l) T }|. We then relax this / constraint by the harmonic mean of pk,l , i.e., ( m k , l) T / 2(m - m) c ¯ L QT +W T (i, j ) T , Qi,j 0, 2 Wk,l , (k , l) T / 2 (9) p-,1 kl ( 1 , or m p-,1 m2 kl k , l) T / ¯ where matrix T is defined as 0 (i, j ) T ¯ Ti,j = , 1 otherwise Active Kernel Learning and stands for the element wise product of matrices. The pro of can be found in Appendix C. In the dual problem, variables Qi,j and Wi,j are the dual variables that indicate the importance of labeled example pairs and unlabeled examples, respectively. We thus will select the unlabeled example pair with the largest |Wi,j |. To speed up the computation, in our experiment, we first select a subset of example pairs (fixed 200) with smallest |Zi,j | using the current kernel matrix Z . We then set all Wk,l to be zero if the corresponding pair is not selected. In this way, we significantly reduce the number of variables in the dual problem in (9), thus simplifying the computation. Table 1. The nine datasets used in our exp eriments. The first two are the artificial datasets from (Hoi et al., 2007) and the others are from the UCI machine learning rep ository. Dataset Chessb oard Double-Spiral Glass Heart Iris Protein Sonar Soyb ean Wine # Classes 2 2 6 2 3 6 2 4 3 #Instances 100 100 214 270 150 116 208 47 178 #Features 2 3 9 13 4 20 60 35 12 3. Experimental Results In our experiments, we follow the work (Hoi et al., 2007), and evaluate the proposed algorithm for active kernel learning by the experiments of data clustering. More specifically, we first apply the active kernel learning algorithm to identify the most informative example pairs, and then solicit the class labels for the selected example pairs. A kernel matrix will be learned from the labeled example pairs, and the learned kernel matrix will be used by the clustering algorithm to find the right cluster structure. 3.1. Exp erimental Setup We use the same datasets as the ones described in (Hoi et al., 2007). Table 1 summarizes the information about the nine datasets used in our study. We adopt the clustering accuracy defined in (Xing et al., 2002) as the evaluation metric. It is defined as follows i 1 { 1 { ci = cj } = 1 { ci = cj } } ^ ^ , (10) Accur a cy = 0.5n(n - 1) >j where 1{·} is the indicator function that outputs 1 when the input argument is true and 0 otherwise. ci and ci denote the true cluster membership and the ^ predicted cluster membership of the ith data point, respectively. n is the number of examples in the dataset. For the graph Laplacian L used by the nonparametric kernel learning, we apply the standard metho d for all experiments, i.e., by calculating the distance matrix by Euclidean distance, then constructing the adjacency matrix with five nearest neighbors, and finally normalizing the graph to achieve the final Laplacian matrix. 3.2. Performance Evaluation To evaluate the quality of the learned kernels, we extend the proposed kernel learning algorithm to solve clustering problems with pairwise constraints. In the experiments, we employ the kernel k-means as the clustering metho d, in which the kernel is learned by the proposed non-parametric kernel learning method. In addition to the proposed active kernel learning metho d, two baseline approaches are implemented to select informative example pairs for kernel learning. Totally we have: · Random: This baseline metho d randomly samples example pairs from the po ol of unlabeled pa ir s. · AKL-min-|Z|: This baseline metho d cho oses the pair examples with the least |Zk,l |, where matrix Z is learned by the non-parametric kernel learning metho d. As already discussed in the intro duction section, this approach may not find the most informative example pairs. · AKL-min-H: This is the proposed AKL algorithm. It selects the example pairs with least Hk,l that corresponds to the maximal selection probability Pk,l . To examine the performance of the proposed AKL algorithm in a full spectrum, we evaluate the clustering results with respect to different sampling sizes. Specifically, for each experiment, we first randomly sample Nc pairwise constraints as the initially labeled pair examples. We then employ the nonparametric kernel learning metho d to learn a kernel from the given pairwise constraints. This learned kernel is engaged by the kernel k-means metho d for data clustering. Next, we apply the AKL metho d to sample 20 pair examples (i.e. 20 pairwise constraints) for labeling in an iteration, and then examine the clustering results based on the kernel that is learned from the augmented set of example pairs in each iteration. Active Kernel Learning Each experiment is repeated 50 times with multiple restarts for clustering. Fig. 2 shows the experimental results on the nine datasets with five active kernel learning iterations. First of all, we observe that AKL-min-|Z |, i.e., the naive AKL approach that samples the example pairs with the least |Z |, do es not always outperform the random sampling approach. In fact, it only outperforms the random sampling approach on five out of the nine datasets. It performs noticeably worse than the random approach on dataset "sonar" and "heart". Compared with the two baseline approaches, the proposed AKL algorithm (i.e., AKLmin-H ) achieves considerably better performance for most datasets. For example, for the "Double-Spiral" dataset, after 3 active kernel learning iterations, the proposed algorithm is able to achieve the clustering accuracy of 99.6%, but the clustering accuracies of the other two metho ds are less than 98.8%. These experimental results show the effectiveness of the proposed algorithm as a promising approach for active kernel learning. zero for other pairs) to (6) is a global optimal solution to (5). This is because (6) is a relaxed version of (5). Third, one of the global optimal solutions to (6) is an extreme point. This is because the first order condition of optimality requires p,l to be a solution to the k following problem: c( pk,l [,l ]2 (11) min k p 2 k,l)T ( s. t. pk,l 1, pk,l 0, (k , l) T k,l)T 4. Conclusion In this paper we proposed a min-max framework for active kernel learning that specifically addresses the problem of how to identify the informative pair examples for efficient kernel learning. A promising algorithm is presented that approximates the original min-max optimization problem into a convex programming problem. Empirical evaluation based on the performance of data clustering showed that our proposed algorithm for active kernel learning is effective in identifying informative example pairs for the learning of kernel matrix. wher e is the optimal solution for k,l . Since (11) is a linear optimization problem, it is well known that one of its global optimal solutions is an extreme point. Combining the above arguments together, we prove there exists a global solution to (5), denoted by ((k , l) , Z , ,j ) that is also a global solution to (6) i with p(k,l) = 1. We extend this conclusion to any other global solution ((k , l) , Z , i,j ) to (5) because ((k , l) , Z , i,j ) results in the same value for the problem in (6) as solution ((k , l) , Z , ,j ). This completes i our pro of. ,l k Appendix B: Proof of Theorem 4 Proof. To show (8) is a SDP problem, we introduce slack variables for both labeled and unlabeled example pairs, i.e., i,j 2,j and k,l 2 ,l /hk,l . We can turn i k these two nonlinear constraints into LMI constraints, i.e., 0 0 i,j k ,l i,j k ,l , i,j 1 k , l hk , l Using the slack variables, we rewrite (8) as c( c( min tr(LZ ) + i,j + Z 0,h, 2 2 i,j )T Acknowledgments The work was supported in part by the National Science Foundation (IIS-0643494), National Institute of Health (1R01-GM079688-01), and Singapore NTU AcRF Tier-1 Research Grant (RG67/07). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF and NIH. k,l(12) k,l)T s. t. Ti,j Zi,j 1 - i,j , i,j 0, (i, j ) T k,l - 1 Zk,l 1 - k,l , (k , l) T ( hk,l m2 , hk,l 1, (k , l) T k,l)T i,j i,j k ,l i,j 1 k ,l hk , l 0 , (i, j ) T 0 , (k , l) T , Appendix A: Proof of Theorem 3 Proof. Firs( , for any global optimal solution to (6), t we have k,l)T pk,l = 1 though the constraint in (6) / ( pk,l 1. This is because we can always is k , l) T / ( scale down pk,l if k,l)T pk,l > 1, which guarantees / to reduce the ob jective function. Second, any extreme point solution (i.e., pk,l = 1 for one example pair and k ,l which is clearly a SDP problem. To show the second part of theorem, we follow the inequality that a harmonic mean is upper bounded by an arithmetic mean, i.e., m m 1 1( ( pk ,l ( -1 = m hk , l m k , l) T / k , l) T p k , l / k , l) T / Active Kernel Learning Chessboard (N=100, C=2, D=2, Nc=20) Double-Spiral (N=100, C=2, D=3, Nc=20) Glass (N=214, C=6, D=9, Nc=100) 0.56 0.54 Accuracy 0.52 0.5 0.48 Random AKL-min-|Z| AKL-min-H Accuracy 1 0.8 0.75 Accuracy 0.95 0.7 Random AKL-min-|Z| AKL-min-H 0 1 2 3 4 Number of Iterations 5 0.9 Random AKL-min-|Z| AKL-min-H 0 1 2 3 4 Number of Iterations 5 0.65 0 1 2 3 4 Number of Iterations 5 0.85 Heart (N=270, C=2, D=13, Nc=100) 0.6 Random AKL-min-|Z| AKL-min-H Accuracy Iris (N=150, C=3, D=4, Nc=100) 1 0.98 Accuracy 0.96 0.94 0.92 0.9 Random AKL-min-|Z| AKL-min-H 0 1 2 3 4 Number of Iterations 5 Protein (N=116, C=6, D=20, Nc=100) Random AKL-min-|Z| AKL-min-H 0.58 Accuracy 0.9 0.85 0.8 0.75 0.7 0.56 0.54 0.52 0 1 2 3 4 Number of Iterations 5 0 1 2 3 4 Number of Iterations 5 Sonar (N=208, C=2, D=60, Nc=100) 0.7 Accuracy Random AKL-min-|Z| AKL-min-H Accuracy Soybean (N=47, C=4, D=35, Nc=20) 1 0.95 0.9 0.85 0.8 Random AKL-min-|Z| AKL-min-H 0 1 2 3 4 Number of Iterations 5 Accuracy Wine (N=178, C=3, D=12, Nc=100) 0.86 0.84 0.82 0.8 0.78 0.76 0.74 0 1 2 3 4 Number of Iterations 5 Random AKL-min-|Z| AKL-min-H 0.65 0.6 0.55 0 1 2 3 4 Number of Iterations 5 Figure 2. The clustering accuracy of different AKL methods for kernel k-means algorithms with nonparametric kernels learned from pairwise constraints. In each individual diagram, the three curves are resp ectively the random sampling method, the active kernel learning method for selecting pair examples with the least |Zk,l | (AKL-min-|Z |), and the active kernel learning method with minimal H values learned from our prop osed algorithm (AKL-min-H ). The details of the datasets are also shown in each diagram. In particular, N , C , D, and N c resp ectively denote the dataset size, the numb er of classes, the numb er of features, and the numb er of initially sampling pairwise constraints. In each of the five iterations, 20 pair examples are sampled for lab eling by the compared algorithms. Active Kernel Learning Hence, any feasible solution to (8) is also a feasible solution to (6), and (8) is a restricted version of (8), which leads to the conclusion that the optimal output value for (8) provides the upper bound for that of (6). References Belkin, M., & andd P. Niyogi, I. M. (2004). Regularization and semi-supervised learning on large graphs. Intl. Conf. on Learning Theory (COLT). Chapelle, O., Weston, J., & Scholkopf, B. (2003). Clus¨ ter kernels for semi-supervised learning. . Chen, F., & Jin, R. (2007). Active algorithm selection. Proceedings of the Twenty-Second Conference on Artificial Intel ligence (AAAI). Cristianini, N., Shawe-Taylor, J., & Elisseeff, A. (2002). On kernel-target alignment. JMLR. Hoi, S. C. H., Jin, R., & Lyu, M. R. (2007). Learning nonparametric kernel matrices from pairwise constraints. ICML2007. Corvallis, OR, US. Hoi, S. C. H., Jin, R., Zhu, J., & Lyu, M. R. (2006). Batch mo de active learning and its application to medical image classification. ICML2006 (pp. 417­ 424). Pittsburgh, Pennsylvania. Kondor, R., & Lafferty, J. (2002). Diffusion kernels on graphs and other discrete structures. ICML'2002. Kulis, B., Sustik, M., & Dhillon, I. S. (2006). Learning low-rank kernel matrices. ICML2006 (pp. 505­512). Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. (2004). Learning the kernel matrix with semi-definite programming. JMLR, 5, 27­72. Scholkopf, B., & Smola, A. (2002). Learning with kernels. MIT Press. Sturm, J. (1999). Using sedumi: a matlab to olbox for optimization over symmetric cones. Optimization Methods and Software, 11­12, 625­653. Tong, S., & Koller, D. (2000). Support vector machine active learning with applications to text classification. ICML2000 (pp. 999­1006). Stanford, US. Vapnik, V. N. (1998). Statistical learning theory. John Wiley & Sons. Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (2002). Distance metric learning with application to clustering with side-information. NIPS2002. Zhu, X., Kandola, J., Ghahramani, Z., & Lafferty, J. (2005). Nonparametric transforms of graph kernels for semi-supervised learning. NIPS2005. Appendix C: Proof of Theorem 5 Proof. We first constructe the Lagrangian function for the above problem c( c( i,j + k,l L = tr(L Z ) + 2 2 i,j )T k , l) T / ( - Qi,j (Ti,j Zi,j + i,j - 1) - ( i,j )T (i,j i,j + i,j /2 - 2i,j i,j ) - tr(M Z ) i,j )T - - - ( ( ( k , l) T / sk,l (hk,l - 1) - m - 2 ( k , l) T / hk , l (k,l k,l + k,l hk,l /2 - 2k,l k,l ) k , l) T / Wk,l Zk,l + (k,l - 1)|Wk,l | k , l) T / In the above, we intro duce Lagrangian multiplier f -i,j i,j -i,j i,j /2 or constraints i,j i,j i,j 1 0 and k ,l k ,l k ,l hk , l 0 By setting the derivative to be zero, we have Q | ( ( i,j + k,l ( Wk,l | - 13) max i,j - 2 2 i,j )T k , l) T / s. t -(m - 1) ¯ L QT +W T 0 c -Qi,j , Qi,j 0, (i, j ) T -Qi,j i,j 0 k,l 2, (k , l) T / c 0 -|Wk,l | , (k , l) T / -|Wk,l | k,l 2 The two LMI constraints can be simplified as i,j 2Q2,j /c, i k,l 2 Q2 ,l / c k Substituting the above constraints into (13), we have (9).