Multi-label Multiple Kernel Learning

Shuiwang Ji Arizona State University Tempe, AZ 85287 shuiwang.ji@asu.edu Rong Jin Michigan State University East Lansing, MI 48824 rongjin@cse.msu.edu

Liang Sun Arizona State University Tempe, AZ 85287 sun.liang@asu.edu Jieping Ye Arizona State University Tempe, AZ 85287 jieping.ye@asu.edu

Abstract
We present a multi-label multiple kernel learning (MKL) formulation in which the data are embedded into a low-dimensional space directed by the instancelabel correlations encoded into a hypergraph. We formulate the problem in the kernel-induced feature space and propose to learn the kernel matrix as a linear combination of a given collection of kernel matrices in the MKL framework. The proposed learning formulation leads to a non-smooth min-max problem, which can be cast into a semi-infinite linear program (SILP). We further propose an approximate formulation with a guaranteed error bound which involves an unconstrained convex optimization problem. In addition, we show that the objective function of the approximate formulation is differentiable with Lipschitz continuous gradient, and hence existing methods can be employed to compute the optimal solution efficiently. We apply the proposed formulation to the automated annotation of Drosophila gene expression pattern images, and promising results have been reported in comparison with representative algorithms.

1 Introduction
Spectral graph-theoretic methods have been used widely in unsupervised and semi-supervised learning recently. In this paradigm, a weighted graph is constructed for the data set, where the nodes represent the data points and the edge weights characterize the relationships between vertices. The structural and spectral properties of graph can then be exploited to perform the learning task. One fundamental limitation of using traditional graphs for this task is that they can only represent pairwise relationships between data points, and hence higher-order information cannot be captured [1]. Hypergraphs [1, 2] generalize traditional graphs by allowing edges, called hyperedges, to connect more than two vertices, thereby being able to capture the relationships among multiple vertices. In this paper, we propose to use a hypergraph to capture the correlation information for multi-label learning [3]. In particular, we propose to construct a hypergraph for multi-label data in which all data points annotated with a common label are included in a hyperedge, thereby capturing the similarity among data points with a common label. By exploiting the spectral properties of the constructed hypergraph, we propose to embed the multi-label data into a lower-dimensional space in which data points with a common label tend to be close to each other. We formulate the multi-label learning problem in the kernel-induced feature space, and show that the well-known kernel canonical correlation analysis (KCCA) [4] is a special case of the proposed framework. As the kernel plays an essential role in the formulation, we propose to learn the kernel matrix as a linear combination of a given collection of kernel matrices in the multiple kernel learning (MKL) framework. The resulting 1


formulation involves a non-smooth min-max problem, and we show that it can be cast into a semiinfinite linear program (SILP). To further improve the efficiency and reduce the non-smoothness effect of the SILP formulation, we propose an approximate formulation by introducing a smoothing term into the original problem. The resulting formulation is unconstrained and convex. In addition, the objective function of the approximate formulation is shown to be differentiable with Lipschitz continuous gradient. We can thus employ the Nesterov's method [5, 6], which solves smooth convex problems with the optimal convergence rate, to compute the solution efficiently. We apply the proposed formulation to the automated annotation of Drosophila gene expression pattern images, which document the spatial and temporal dynamics of gene expression during Drosophila embryogenesis [7]. Comparative analysis of such images can potentially reveal new genetic interactions and yield insights into the complex regulatory networks governing embryonic development. To facilitate pattern comparison and searching, groups of images are annotated with a variable number of labels by human curators in the Berkeley Drosophila Genome Project (BDGP) high-throughput study [7]. However, the number of available images produced by high-throughput in situ hybridization is now rapidly increasing. It is therefore tempting to design computational methods to automate this task [8]. Since the labels are associated with groups of a variable number of images, we propose to extract invariant features from each image and construct kernels between groups of images by employing the vocabulary-guided pyramid match algorithm [9]. By applying various local descriptors, we obtain multiple kernel matrices and the proposed multi-label MKL formulation is applied to obtain an optimal kernel matrix for the low-dimensional embedding. Experimental results demonstrate the effectiveness of the kernel matrices obtained by the proposed formulation. Moreover, the approximate formulation is shown to yield similar results to the original formulation, while it is much more efficient.

2 Multi-label Learning with Hypergraph
An essential issue in learning from multi-label data is how to exploit the correlation information among labels. We propose to capture such information through a hypergraph as described below. 2.1 Hypergraph Spectral Learning

Hypergraphs generalize traditional graphs by allowing hyperedges to connect more than two vertices, thus capturing the joint relationships among multiple vertices. We propose to construct a hypergraph for multi-label data in which each data point is represented as a vertex. To document the joint similarity among data points annotated with a common label, we propose to construct a hyperedge for each label and include all data points annotated with a common label into one hyperedge. Following the spectral graph embedding theory [10], we propose to compute the low-dimensional embedding through a linear transformation W by solving the following optimization problem: WT ( min tr (X )L(X )T W 1) W  W subject to W T (X )(X )T + I = I, where (X ) = [(x1 ), · · · , (xn )] is the data matrix consisting of n data points in the feature space,  is the feature mapping, L is the normalized Laplacian matrix derived from the hypergraph, and  > 0 is the regularization parameter. In this formulation, the instance-label correlations are encoded into L through the hypergraph, and data points sharing a common label tend to be close to each other in the embedded space. It follows from the representer theorem [11] that W = (X )B for some matrix B  Rn×k where k is the number of labels. By noting that L = I - C for some matrix C , the problem in Eq. (1) can be reformulated as B ( max tr T (K C K )B 2)
B

subject to B T (K 2 + K )B = I , where K = (X )T (X ) is the kernel matrix. Kernel canonical correlation analysis (KCCA) [4] is a widely-used method for dimensionality reduction. It can be shown [4] that KCCA is obtained by substituting C = Y T (Y Y T )-1 Y in Eq. (2) where Y  Rk×n is the label indicator matrix. Thus, KCCA is a special case of the proposed formulation. 2


2.2

A Semi-infinite Linear Program Formulation

It follows from the theory of kernel methods [11] that the kernel K in Eq. (2) uniquely determines the feature mapping . Thus, kernel selection (learning) is one of the central issues in kernel methods. Following the MKL framework [12], we propose to learn an optimal kernel matrix by integrating multiple candidate kernel matrices, that is,     jp T KK= K= j Kj e = 1,   0 , (3)  
=1

where {Kj }p=1 are the p candidate kernel matrices, {j }p=1 are the weights for the linear combij j nation, and e is the vector of all ones of length p. We have assumed in Eq. (3) that all the candidate kernel matrices are normalized to have a unit trace value. It has been shown [8] that the optimal weights maximizing the objective function in Eq. (2) can be obtained by solving a semi-infinite linear program (SILP) [13] in which a linear objective is optimized subject to an infinite number of linear constraints, as summarized in the following theorem: Theorem 2.1. Given a set of p kernel matrices {Kj }p=1 , the optimal kernel matrix in K that maxij mizes the objective function in Eq. (2) can be obtained by solving the following SILP problem: max
 ,

   0,  e = 1,
T

(4) jp
=1

subject to

j Sj (Z )   , for all Z  Rn×k ,

(5)

where Sj (Z ), for j = 1, · · · , p, is defined as 1 , ik 1T T T Sj ( Z ) = z zi + z Kj z i - z i hi 4i 4 i =1 Z = [z1 , · · · , zk ], H is obtained from C such that H H T = C , and H = [h1 , · · · , hk ].

(6)

Note that the matrix C is symmetric and positive semidefinite. Moreover, for the L considered in this paper, we have rank(C ) = k . Hence, H  Rn×k is always well-defined. The SILP formulation in Theorem 2.1 can be solved by the column generation technique as in [14].

3 The Approximate Formulation
The multi-label kernel learning formulation proposed in Theorem 2.1 involves optimizing a linear objective subject to an infinite number of constraints. The column generation technique used to solve this problem adds constraints to the problem successively until all the constraints are satisfied. Since the convergence rate of this algorithm is slow, the problem solved at each iteration may involve a large number of constraints, and hence is computationally expensive. In this section, we propose an approximate formulation by introducing a smoothing term into the original problem. This results in an unconstrained and smooth convex problem. We propose to employ existing methods to solve the smooth convex optimization problem efficiently in the next section. By rewriting the formulation in Theorem 2.1 as max min
Z

jp
=1

 : T e=1, 0

j Sj (Z )

and exchanging the minimization and maximization, the SILP formulation can be expressed as min f (Z )
Z

(7)

where f (Z ) is defined as f (Z ) =
 : T e=1, 0

max

jp
=1

j Sj (Z ).

(8)

3


The maximization problem in Eq. (8) with respect to  leads to a non-smooth objective function for f (Z ). To reduce this effect, we introduce a smoothing term and modify the objective to fµ (Z ) as   j p  jp (9) fµ (Z ) = max j Sj (Z ) - µ j log j ,  : T e=1, 0 
=1 =1

where µ is a positive constant controlling the approximation. The following lemma shows that the problem in Eq. (9) can be solved analytically: Lemma 3.1. The optimization problem in Eq. (9) can be solved analytically, and the optimal value can be expressed as  1  jp fµ (Z ) = µ log  exp Sj ( Z )  . (10) µ =1 Proof. Define the Lagrangian function for the optimization problem in Eq. (9) as   jp jp jp jp L= j Sj (Z ) - µ j log j + j j +  j - 1  ,
=1 =1 =1 =1

(11)

where {j }p=1 and  are Lagrangian dual variables. Taking the derivative of the Lagrangian funcj 1 . tion with respect to j and setting it to zero, we obtain that j = exp µ (Sj (Z ) + j +  - µ) It follows from the complementarity condition that j j = 0 for j = 1, · · · , p. Since j = 0, we have j = 0 for j = 1, · · · , p. By removing {j }p=1 and substituting j into the objective function j in Eq. (9), we obtain that fµ (Z ) = µ -  . Since µ -  = Sj (Z ) - µ log j , we have Following 1 = p
j =1 j

=

p

j = exp ((Sj (Z ) - fµ (Z ))/µ) . exp ((Sj (Z ) - fµ (Z ))/µ) , we obtain Eq. (10).

(12)

j =1

The above discussion shows that we can approximate the original non-smooth constrained min-max problem in Eq. (7) by the following smooth unconstrained minimization problem: min fµ (Z ),
Z

(13)

where fµ (Z ) is defined in Eq. (10). We show in the following two lemmas that the approximate formulation in Eq. (13) is convex and has a guaranteed approximation bound controlled by µ. Lemma 3.2. The problem in Eq. (13) is a convex optimization problem. Proof. The optimization problem in Eq. (13) can be expressed equivalently as  u  jp ik T z i hi  min µ log  exp j + vj -
Z,{uj }p=1 ,{vj }p=1 j j =1 =1 k 1i z T zi , 4 =1 i k 1i z T Kj zi , j = 1, · · · , p. 4 =1 i

(14)

subject to

µuj 

µvj 

Since the log-exponential-sum function is a convex function and the two constraints are second-order cone constraints, the problem in Eq. (13) is a convex optimization problem. Lemma 3.3. Let f (Z ) and fµ (Z ) be defined as above. Then we have fµ (Z )  f (Z ) and |fµ (Z ) - f (Z )|  µ log p. p Proof. The term - j =1 j log j defines the entropy of {j }p=1 when it is considered as a probaj bility distribution, since   0 and T e = 1. Hence, this term is non-negative and fµ (Z )  f (Z ). It p is known from the property of entropy that - j =1 j log j is maximized with a uniform {j }p=1 , j p 1 i.e., j = p for j = 1, · · · , p. Thus, we have - j =1 j log j  log p and |fµ (Z ) - f (Z )| = p -µ j =1 j log j  µ log p. This completes the proof of the lemma. 4


4 Solving the Approximate Formulation Using the Nesterov's Method
The Nesterov's method (known as "the optimal method" in [5]) is an algorithm for solving smooth convex problems with the optimal rate of convergence. In this method, the objective function needs to be differentiable with Lipschitz continuous gradient. In order to apply this method to solve the proposed approximate formulation, we first compute the Lipschitz constant for the gradient of function fµ (Z ), as summarized in the following lemma: Lemma 4.1. Let fµ (Z ) be defined as in Eq. (10). Then the Lipschitz constant L of the gradient of fµ (Z ) can be bounded from above as L  Lµ , (15) where Lµ is defined as Lµ = 1 1 1 + max max (Kj ) + tr(Z T Z ) max max ((Ki - Kj )(Ki - Kj )T ), (16) 1i,j p 2 2 1j p 8µ2

and max (·) denotes the maximum eigenvalue. Moreover, the distance from the origin to the optimal 2 2 set of Z can be bounded as tr(Z T Z )  Rµ where Rµ is defined as 4 | CI C 2 ik 1 2 T Rµ = |[Cj ]i ||2 + µ log p + tr + Kj j , (17) j  =1 Cj = 2 I
1 +  Kj

-1

H and [Cj ]i denotes the ith column of Cj .

Proof. To compute the Lipschitz constant for the gradient of fµ (Z ), we first compute the first and second order derivatives as follows: v - jp ec(Z ) vec(Kj Z ) fµ (Z ) = gj + vec(H ), (18) 2 2 =1 2fµ (Z ) = + j gj 1 I+ Dk (Kj ) 2 2 =1 v p 1i ec(Ki Z ) vec(Kj Z ) gi gj - 8µ ,j =1  
p

v T ec(Ki Z ) vec(Kj Z ) - , (19)  

where vec(·) converts a matrix into a vector, Dk (Kj )  R(n×k)×(n×k) is a block diagonal matrix p with the k th diagonal block as Kj , and gj = exp(Sj (Z )/µ)/ i=1 exp(Si (Z )/µ). Then we have L 1 1 1 + max max (Kj ) + max tr(Z T (Ki - Kj )(Ki - Kj )T Z )  Lµ . 2 2 1j p 8µ2 1i,j p

where Lµ is defined in Eq. (16). We next derive the upper bound for tr(Z T Z ). To this end, we first rewrite Sj (Z ) as ( I ( - CI C. 1 1 1 1 T T + Kj Z - C j ) tr + Kj Sj (Z ) = tr Z - Cj ) j j 4  4  Since min fµ (Z )  fµ (0) = µ log p, and fµ (Z )  Sj (Z ), we have SI (Z )  µCog p for j =  l. ( CT j 1 1 1 1, · · · , p. It follows that 4 tr Z - Cj )T (Z - Cj ) µ log p + 4 tr j +  Kj j By using 2 2 this inequality, it can be verified that tr(Z T Z )  Rµ where Rµ is defined in Eq. (17). The Nesterov's method for solving the proposed approximate formulation is presented in Table 1. After the optimal Z is obtained from the Nesterov's method, the optimal {j }p=1 can be computed j from Eq. (12). It follows from the convergence proof in [5] that after N iterations, as long as i 0 fµ (X )  fµ (X ) for i = 1, · · · , N , we have fµ (Z N +1 ) - fµ (Z  )  5
2 4Lµ Rµ , (N + 1)2

(20)


Table 1: The Nesterov's method for solving the proposed multi-label MKL formulation. · Initialize X 0 = Z 1 = Q0 = 0  Rn×k , t0 = 1, L0 = 1 + 2 1 µ = N where N is the predefined number of iterations · for i = 1, · · · , N do · Set X i = Z i - ti1 1 (Z i + Qi-1 ) - · Compute fµ (X i ) and fµ (X i ) · Set L = Li-1 · while fµ (X i - fµ (X i )/L) > fµ (X i ) - 21 tr(( fµ (X i ))T fµ (X i )) do L · L=L×2 · end while · Set Li = L 1 · Set Z i+1 = X i - Li fµ (X i ), Qi = Qi-1 + ti-1 fµ (X i ) Li 1 1 1 2 · Set ti = 2 + + 4ti-1 end for where Z  = arg minZ fµ (Z ). Furthermore, since fµ (Z N +1 )  f (Z N +1 ) and fµ (Z  )  f (Z  ) + µ log p, we have 2 4Lµ Rµ f (Z N +1 ) - f (Z  )  µ log p + . (21) (N + 1)2 By setting µ = O(1/N ), we have that Lµ  O(1/µ)  O(N ). Hence, the convergence rate of the Nesterov's method is on the order of O(1/N ). This is significantly better than the convergence rates of O(1/N 1/3 ) and O(1/N 1/2 ) for the SILP and the gradient descent method, respectively.
1 2

max1j p max (Kj ), and

·

5

Experiments

In this section, we evaluate the proposed formulation on the automated annotation of gene expression pattern images. The performance of the approximate formulation is also validated. Experimental Setup The experiments use a collection of gene expression pattern images retrieved from the FlyExpress database (http://www.flyexpress.net). We apply nine local descriptors (SIFT, shape context, PCA-SIFT, spin image, steerable filters, differential invariants, complex filters, moment invariants, and cross correlation) on regular grids of 16 and 32 pixels in radius and spacing on each image. These local descriptors are commonly used in computer vision problems [15]. We also apply Gabor filters with different wavelet scales and filter orientations on each image to obtain global features of 384 and 2592 dimensions. Moreover, we sample the pixel values of each image to obtain features of 10240, 2560, and 640 dimensions. After generating the features, we apply the vocabulary-guided pyramid match algorithm [9] to construct kernels between the image sets. A total of 23 kernel matrices (2 grid size × 9 local descriptors + 2 Gabor + 3 pixel) are constructed. Then the proposed MKL formulation is employed to obtain the optimal integrated kernel matrix based on which the low-dimensional embedding is computed. We use the expansion-based approach (star and clique) to construct the hypergraph Laplacian, since it has been shown [1] that the Laplacians constructed in this way are similar to those obtained directly from a hypergraph. The performance of kernel matrices (either single or integrated) is evaluated by applying the support vector machine (SVM) for each term using the one-against-rest scheme. The F1 score is used as the performance measure, and both macro-averaged and micro-averaged F1 scores across labels are reported. In each case, the entire data set is randomly partitioned into training and test sets with a ratio of 1:1. This process is repeated ten times, and the averaged performance is reported. Performance Evaluation It can be observed from Tables 2 and 3 that in terms of both macro and micro F1 scores, the kernels integrated by either star or clique expansions achieve the highest performance on almost all of the data sets. In particular, the integrated kernels outperform the best individual kernel significantly on all data sets. This shows that the proposed formulation is effective 6


Table 2: Performance of integrated kernels and the best individual kernel (denoted as BIK) in terms of macro F1 score. The number of terms used are 20, 30, and 40, and the number of image sets used are 1000, 1500, and 2000. "SILP", "APP", "SVM1", and "Uniform" denote the performance of kernels combined with the SILP formulation, the approximate formulation, the 1-norm SVM formulation proposed in [12] applied for each label separately, and the case where all kernels are given the same weight, respectively. The subscripts "star" and "clique" denote the way that Laplacian is constructed, and "KCCA" denotes the case where C = Y T (Y Y T )-1 Y . # of labels 20 30 40 # of sets 1000 1500 2000 1000 1500 2000 1000 1500 2000 SILPstar 0.4396 0.4903 0.4575 0.3852 0.4437 0.4162 0.3768 0.4019 0.3927 SILPclique 0.4536 0.5125 0.4926 0.4065 0.4747 0.4563 0.4145 0.4346 0.4283 SILPKCCA 0.3987 0.4635 0.4477 0.3497 0.4240 0.4063 0.3538 0.3872 0.3759 APPstar 0.4404 0.4930 0.4703 0.3896 0.4494 0.4267 0.3900 0.4100 0.3983 APPclique 0.4510 0.5125 0.4917 0.4060 0.4741 0.4563 0.4180 0.4338 0.4281 APPKCCA 0.4029 0.4805 0.4586 0.3571 0.4313 0.4146 0.3642 0.3914 0.3841 SVM1 0.3780 0.4640 0.4356 0.3523 0.4352 0.4200 0.3741 0.4048 0.3955 Uniform 0.3727 0.4703 0.4480 0.3513 0.4410 0.4191 0.3719 0.4111 0.3986 0.4241 0.4515 0.4344 0.3782 0.4312 0.3996 0.3914 0.3954 0.3827 BIK

Table 3: Performance in terms of micro F1 score. See the caption of Table 2 for explanations. # of labels 20 30 40 # of sets 1000 1500 2000 1000 1500 2000 1000 1500 2000 SILPstar 0.4861 0.5199 0.4847 0.4472 0.4837 0.4473 0.4277 0.4470 0.4305 SILPclique 0.5039 0.5422 0.5247 0.4682 0.5127 0.4894 0.4610 0.4796 0.4660 SILPKCCA 0.4581 0.4994 0.4887 0.4209 0.4737 0.4532 0.4095 0.4420 0.4271 APPstar 0.4852 0.5211 0.4973 0.4484 0.4875 0.4582 0.4355 0.4541 0.4346 APPclique 0.5013 0.5421 0.5239 0.4673 0.5124 0.4894 0.4633 0.4793 0.4658 APPKCCA 0.4612 0.5174 0.5018 0.4299 0.4828 0.4605 0.4194 0.4488 0.4350 SVM1 0.4361 0.5024 0.4844 0.4239 0.4844 0.4632 0.3947 0.4234 0.4188 Uniform 0.4390 0.5096 0.4975 0.4242 0.4939 0.4683 0.3999 0.4358 0.4226 BIK 0.4614 0.4735 0.4562 0.4189 0.4484 0.4178 0.3869 0.3905 0.3781

in combining multiple kernels and exploiting the complementary information contained in different kernels constructed from various features. Moreover, the proposed formulation based on a hypergraph outperforms the classical KCCA consistently. SILP versus the Approximate Formulation In terms of classification performance, we can observe from Tables 2 and 3 that the SILP and the approximate formulations are similar. More precisely, the approximate formulations perform slightly better than SILP in almost all cases. This may be due to the smoothness nature of the formulations and the simplicity of the computational procedure employed in the Nesterov's method so that it is less prone to numerical problems. Figure 1 compares the computation time and the kernel weights of SILPstar and APPstar . It can be observed that in general the approximate formulation is significantly faster than SILP, especially when the number of labels and the number of image sets are large, while they both yields very similar kernel weights.

6 Conclusions and Future Work
We present a multi-label learning formulation that incorporates instance-label correlations by a hypergraph. We formulate the problem in the kernel-induced feature space and propose to learn the kernel matrix in the MKL framework. The resulting formulation leads to a non-smooth min-max problem, and it can be cast as an SILP. We propose an approximate formulation by introducing a smoothing term and show that the resulting formulation is an unconstrained convex problem that can be solved by the Nesterov's method. We demonstrate the effectiveness and efficiency of the method on the task of automated annotation of gene expression pattern images. 7


400

SILPstar APPstar

0.4 0.35 0.3

SILP APP

star

350

star

Computation time (in seconds)

300

Weight for kernels
20(1000) 20(1500) 20(2000) 30(1000) 30(1500) 30(2000) 40(1000) 40(1500) 40(2000)

250

0.25 0.2 0.15 0.1 0.05 0

200

150

100

50

0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

The number of labels and the number of image sets

Kernel number

(a) Comparison of computation time

(b) Comparison of kernel weights

Figure 1: Comparison of computation time and kernel weights for SILPstar and APPstar . The left panel plots the computation time of two formulations on one partition of the data set as the number of labels and image sets increase gradually, and the right panel plots the weights assigned to each of the 23 kernels by SILPstar and APPstar on a data set of 40 labels and 1000 image sets. The experiments in this paper focus on the annotation of gene expression pattern images. The proposed formulation can also be applied to the task of multiple object recognition in computer vision. We plan to pursue other applications in the future. Experimental results indicate that the best individual kernel may not lead to a large weight by the proposed MKL formulation. We plan to perform a detailed analysis of the weights in the future.

Acknowledgements
This work is supported in part by research grants from National Institutes of Health (HG002516 and 1R01-GM079688-01) and National Science Foundation (IIS-0612069 and IIS-0643494).

References
[1] S. Agarwal, K. Branson, and S. Belongie. Higher order learning with graphs. In ICML, pages 17­24, 2006. ¨ [2] D. Zhou, J. Huang, and B. Scholkopf. Learning with hypergraphs: Clustering, classification, and embedding. In NIPS, pages 1601­1608. 2007. [3] Z. H. Zhou and M. L. Zhang. Multi-instance multi-label learning with application to scene classification. In NIPS, pages 1609­1616. 2007. [4] D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639­2664, 2004. [5] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2003. [6] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127­ 152, 2005. [7] P. Tomancak and et al. Systematic determination of patterns of gene expression during Drosophila embryogenesis. Genome Biology, 3(12), 2002. [8] S. Ji, L. Sun, R. Jin, S. Kumar, and J. Ye. Automated annotation of Drosophila gene expression patterns using a controlled vocabulary. Bioinformatics, 24(17):1881­1888, 2008. [9] K. Grauman and T. Darrell. Approximate correspondences in high dimensions. In NIPS, pages 505­512. 2006. [10] F. R. K. Chung. Spectral Graph Theory. American Mathematical Society, 1997. ¨ [11] S. Scholkopf and A. Smola. Learning with Kernels: Support Vector Machines,Regularization, Optimization and Beyond. MIT Press, 2002. [12] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:27­72, 2004. [13] R. Hettich and K. O. Kortanek. Semi-infinite programming: Theory, methods, and applications. SIAM Review, 35(3):380­429, 1993. ¨ ¨ ¨ [14] S. Sonnenburg, G. Ratsch, C. Schafer, and B. Scholkopf. Large scale multiple kernel learning. Journal of Machine Learning Research, 7:1531­1565, July 2006. [15] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10):1615­1630, 2005.

8