Robust Matching and Recognition using Context-Dep endent Kernels

Hichem Sahbi1 sahbi@telecom-paristech.fr Jean-Yves Audib ert2 , 3 audibert@certis.enpc.fr Jaonary Rabarisoa2 and Renaud Keriven2 {rabariso,keriven}@certis.enpc.fr 1 CNRS - LTCI, UMR 5141, Telecom ParisTech, 46 rue Barrault, 75013 Paris, France. 2 Certis - Ecole des Ponts, 6 avenue Blaise Pascal, Cit´ Descartes, 77455 Marne-la-Vall´e, France. e e 3 Willow - ENS / INRIA, 45 rue d'Ulm, 75005 Paris, France.

Abstract
The success of kernel methods including support vector machines (SVMs) strongly depends on the design of appropriate kernels. While initially kernels were designed in order to handle fixed-length data, their extension to unordered, variable-length data became more than necessary for real pattern recognition problems such as ob ject recognition and bioinformatics. We focus in this paper on ob ject recognition using a new type of kernel referred to as "context-dependent". Ob jects, seen as constellations of local features (interest points, regions, etc.), are matched by minimizing an energy function mixing (1) a fidelity term which measures the quality of feature matching, (2) a neighborhood criterion which captures the ob ject geometry and (3) a regularization term. We will show that the fixedpoint of this energy is a "context-dependent" kernel ("CDK") which also satisfies the Mercer condition. Experiments conducted on object recognition show that when plugging our kernel in SVMs, we clearly outperform SVMs with "context-free" kernels.

1. Intro duction
Ob ject recognition is one of the biggest challenges in vision and its interest is still growing (Everingham et al., 2007). Among existing methods, those based on machine learning (ML), show a particular interest as they are performant and theoretically well grounded (Bishop, 2007). ML approaches, such as the popular
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).

support vector networks (Boser et al., 1992), basically require the design of similarity measures, also referred to as kernels, which should provide high values when two ob jects share similar structures/appearances and should be invariant, as much as possible, to the linear and non-linear transformations. Kernel-based ob ject recognition methods were initially holistic, i.e., each ob ject is mapped into one or multiple fixed-length vectors and a similarity, based on color, texture or shape (Swain & Ballard, 1991; Chapelle et al., 1999), is then defined. Local kernels, i.e., those based on bags or local sets were introduced in order to represent data which cannot be represented by ordered and fixed-length feature vectors, such as graphs, trees, interest points, etc (Gartner, 2003). It is well known that both holistic and local kernels should satisfy certain properties among them the positive definiteness, low complexity for evaluation, flexibility in order to handle variable-length data and also invariance. Holistic kernels have the advantage of being simple to evaluate, discriminating but less flexible than local kernels in order to handle invariance1 . While the design of kernels gathering flexibility, invariance and low complexity is a challenging task; the proof of their positive definiteness is sometimes harder (Cuturi, 2005). This property also known as the Mercer condition ensures, according to Vapnik's SVM theory (Vapnik, 1998), optimal generalization performance and also the uniqueness of the SVM solution. Consider a database of ob jects (images), each one seen as a constellation of local features, for instance interest points (Schmid & Mohr, 1997; Lowe, 2004; Lazebnik et al., 2004), extracted using any suitable filter (Harris & Stephens, 1988). Again, original
1 In case of ob ject recognition, invariance means robustness to occlusion, geometric transformations and illumination.


Robust Matching and Recognition using Context-Dep endent Kernels

holistic kernels explicitly (or implicitly) map ob jects into fixed-length feature vectors and take the similarity as a decreasing function of any well-defined distance (Barla et al., 2002). In contrast to holistic kernels, local ones are designed in order to handle variable-length and unordered data. Two families of local kernels can be found in the literature; those based on statistical "length-insensitive" measures such as the Kullback Leibler divergence, and those which require a preliminary step of alignment. In the first family, the authors in (Kondor & Jebara, 2003; Moreno et al., 2003) estimate for each ob ject (constellation of local features) a probability distribution and compute the similarity between two ob jects (two distributions) using the "Kullback Leibler divergence" in (Moreno et al., 2003) and the "Bhattacharyya affinity" in (Kondor & Jebara, 2003). Only the function in (Kondor & Jebara, 2003) satisfies the Mercer condition and both kernels were applied for image recognition tasks. In (Wolf & Shashua, 2003), the authors discuss a new type of kernel referred to as "principal angles" which is positive definite. Its definition is based on the computation of the principal angles between two linear subspaces under an orthogonality constraint. The authors demonstrate the validity of their method on visual recognition tasks including classification of motion tra jectory and face recognition. An extension to subsets of varying cardinality is proposed in (Shashua & Hazan, 2004). In this first family of kernels, the main drawback, in some methods, resides is the strong assumption about the used probabilistic models in order to approximate the set of local features which may not hold true in practice. In the second family, the "max" kernel (Wallraven et al., 2003) considers the similarity function, between two feature sets, as the sum of their matching scores and unlike discussed in (Wallraven et al., 2003) this kernel is actually not Mercer (Bahlmann et al., 2002). In (Lyu, 2005), the authors introduced the "circularshift" kernel defined as a weighted combination of Mercer kernels using an exponent. The latter is chosen in order to give more prominence to the largest terms so the resulting similarity function approximates the "max" and also satisfies the Mercer condition. The authors combined local features and their relative angles in order to make their kernel rotation invariant and they show its performance for the particular task of ob ject recognition. In (Boughorbel, 2005), the authors introduced the "intermediate" matching kernel, for ob ject recognition, which uses virtual local features in order to approximate the "max" while sat-

Naive matching 'S' 'i' 'r' Context-dependent 'S' 'i' 'r'

'H' 0 0 0 0 0 0

'i' 0 1 0 0 .36 0

-

'S' 1 0 0 .38 0 0

'i' 0 1 0 0 .39 0

'r' 0 0 1 0 0 .38

Table 1. This table shows a simple comparison between similarity measures when using naive matching (upper table) and context-dependent matching (lower table).

isfying the Mercer condition. Recently, (Grauman & Darrell, 2007) introduced the "pyramid-match" kernel, for ob ject recognition and document analysis, which maps feature sets using a multi-resolution histogram representation and computes the similarity using a weighted histogram intersection. The authors showed that their function is positive definite and can be computed linearly with respect to the number of local features. Other matching kernels include the "dynamic programming" function which provides, in (Bahlmann et al., 2002), an effective matching strategy for handwritten character recognition, nevertheless the Mercer condition is not guaranteed. 1.1. Motivation and Contribution The success of the second family of local kernels strongly depends on the quality of alignments which are difficult to obtain mainly when images contain redundant and repeatable structures. Regardless the Mercer condition, a naive matching kernel (such as the "max"), which looks for all the possible alignments and sums the best ones, will certainly fail and results into many false matches (see Figure 1, left). The same argument is supported in (Schmid & Mohr, 1997), for the general problem of visual features matching, about the strong spatial correlation between interest points and the corresponding close local features in the image space. This limitation also appears in closely related areas such as text analysis, and particularly string alignment. A simple example, of aligning two strings ("Sir" and "Hi Sir") using a simple similarity measure 1{c1 =c2 } between any two characters c1 and c2 , shows that without any extra information about the context (i.e., the sub-string) surrounding each character in ("Sir" and "Hi Sir"), the alignment process results into false matches (See Table 1). Hence, it is necessary to consider the context as a part of the alignment process when designing kernels. In this paper, we introduce a new kernel, called "context-dependent" (or "CDK") and defined as the


Robust Matching and Recognition using Context-Dep endent Kernels

fixed-point of an energy function which balances an "alignment quality" term and a "neighborhood" criterion. The alignment quality is inversely proportional to the expectation of the Euclidean distance between the most likely aligned features (see Section 2) while the neighborhood criterion measures the spatial coherence of the alignments; given a pair of features (fp , fq ) with a high alignment quality, the neighborhood criterion is proportional to the alignment quality of all the pairs close to (fp , fq ). The general form of "CDK" captures the similarity between any two features by incorporating also their context, i.e., the similarity of the surrounding features. Our proposed kernel can be viewed as a variant of "dynamic programming" kernel (Bahlmann et al., 2002) where instead of using the ordering assumption we consider a neighborhood assumption which states that two points match if they have similar features and if they satisfies a neighborhood criterion i.e., their neighbors match too. This also appears in other well studied kernels such as Fisher (Jaakkola et al., 1999), which implements the conditional dependency between data using the Markov assumption. "CDK" also implements such dependency with an extra advantage of being the fixed-point and the (sub)optimal solution of an energy function closely related to the goal of our application. This goal is to gather the properties of flexibility, invariance and mainly discrimination by allowing each local feature to consider its context in the matching process. Notice that the goal of this paper is not to extend local features to be global and doing so (as in (Mortensen et al., 2005; Amores et al., 2005)) makes local features less invariant, but rather to design a similarity kernel ("CDK") which captures the context while being invariant. Even though we investigate "CDK" in the particular task of ob ject recognition, we can easily extend it to handle closely related areas in machine learning such as text alignment for document retrieval (Nie et al., 1999), machine translation (Sim et al., 2007) and bioinformatics (Scholkopf et al., 2004). In the remainder of this paper we consider the following terminology and notation. A feature refers to a local interest point xp = (g (xp ), f (xp ), yp ), i i i here i stands for the ith sample of the subset Sp = {xp , . . . , xp } and yp  N+ is a unique indicator n 1 which provides the class or the subset including xp . i g (xp )  R2 stands for the 2D coordinates of the i interest-point xp while f (xp )  Rs corresponds to i i the descriptor of xp (for instance the 128 coefficients i of the SIFT(Lowe, 2004)). We define X as the set of all possible features taken from all the possible images

in the world and X is a random variable standing for a sample in X . We also consider kt : X × X  R as a symmetric function which, given two samples (xp , xq ), j i provides a similarity measure. Other notations will be introduced as we go along through different sections of this paper which is organized as follows. We first introduce in Section 2, our energy function which makes it possible to design our context-dependent kernel and we show that this kernel satisfies the Mercer condition so we can use it for support vector machine training and other kernel methods. In Section 3 we show the application of this kernel in ob ject recognition. We discuss in Section 4 the advantages and weaknesses of this kernel and the possible extensions in order to handle other tasks such as string matching and machine translation. We conclude in Section 5 and we provide some future research directions.

2. Kernel Design
Define X = pN+ Sp as the set of all possible interest points taken from all the possible ob jects in the world. We assume that all the ob jects are sampled with a given cardinality i.e., |Sp | = n, |Sq | = m,  p, q  N+ (n and m might be different). Our goal is to design a kernel K which provides the similarity between any two ob jects (subsets) Sp , Sq in X . Definition 1 (Subset Kernels) let X be an input space, and consider Sp , Sq  X as two finite subsets of X . We define the similarity function or kernel K between Sp = {xp } and Sq = {xq } as K (Sp , Sq ) = j n m xp q i. i , xj i jk

here k is symmetric and continuous on X × X , so K will also be continuous and symmetric. Since K is defined as the cross-similarity k between all the possible sample pairs taken from Sp × Sq , it is obvious that K has the big advantage of not requiring any (hard) alignment between the samples of Sp and Sq . Nevertheless, for a given Sp , Sq , the vx lue of a , i K (Sp , Sq ) should be dominated by maxj k p , xq i j so k should be appropriately designed (see Section 2.1). Let X be a random variable standing for samples taken from Sp and X is defined in a similar way for the subset Sq . We design our kernel k (xp , xq ) = P(X = i j xq , X = xp ) as the joint probability that xq matches j i j xp . Again, it is clear enough (see Figure 1 and Table 1) i that when this joint probability is estimated using only the sample coordinates (without their contexts), this mayPresult in many false i atches and wrong estimate m of (X = xq , X = xp ) ,j . i j


Robust Matching and Recognition using Context-Dep endent Kernels

Before describing the whole design of k , we start with our definition of context-dependent kernels. Definition 2 (Context-Dep endent Kernels) we define a context-dependent kernel k as any symmetric, continuous and recursive function k : X × X  R p such that k (xi , xq ) is equal to j  k x c(xp , xq ) × h  (xp , xq ) V p , xp , xq , xq  , i j i j k k 
, k

should imply high kernel values in the neighborhoods Np (xp ) and Nq (xq ). This criterion makes it possible i j to consider the context (spatial configuration) of each sample in the matching process. We formulate the minimization problem by adding an equality constraint and bounds which ensure that {k (xp , xq )} is a probability distribution. i j Prop osition 1 (1) admits a solution in the form of a context-dependent kernel kt (xp , xq ) = vt (xp , xq )/ i j i j i p q p q Zt , with t  N+ , Zt = ,j vt (xi , xj ) and vt (xi , xj ) defined as × - d(xp , xq ) j i -1 exp    k q p q q p p  (xi , xk , xj , x ) kt-1 (xk , x ) exp  2
, V

here c is a positive (semi) definite and context-free (non-recursive) kernel, V(x, x , y , y ) is a monotonic decreasing function of any (pseudo) distance involving (x, x , y , y ) and h(x) is monotonical ly increasing. 2.1. Approach We consider the issue of designing k using a variational framework. Let Ip = {1, . . . , n}, Iq = {1, . . . , m}, µ = {k (xp , xq )}, d(xp , xq ) = f (xp ) - f (xq ) 2 and i j i j i j Np (xp ) = {xp  Sp : k = i, g (xp ) - g (xp ) 2  p} i i k k ( p defines a neighborhood and Nq is defined in the same way for Sq ). Consider ,   0, (i, j )  Ip × Iq , µ = {k (xp , xq )} is found by solving i j min
µ

(2)

which is also a Gibbs distribution. Pro of. the proof, lengthy, is omitted and it is availI ble in a research report (Sahbi et al., 2007). a n (2), we set v0 to any positive definite kernel r (see proposition 3) and we define V(xp , xk , xq , xs ) as i j p q r s g (xi , xk ) × g (xj , x ) where g is a decreasing function of any (pseudo) distance involving (xp , xr ), not necesi k sarily symmetric. In practice, we consider g (xp , xr ) = i k 1{r=p} × 1{xr Np (xp )} . i k It is easy to see that kt is a P-kernel on any Sp × Sq (Haussler, 1999) (as the joint probability over sample pairs taken from any Sp and Sq sums to one), so the value of the subset kernel K (Sp , Sq ) defined in (1) is constant and useless. To make kt (up to a factor) a P-kernel on X × X (and not on Sp × Sq ), we cancel the equality constraint in (1) and we can prove in a similar way that kt (xp , xq ) is equal to vt (xp , xq ) which i j i j is still a context-dependent kernel. 2.2. Mercer Condition Let X be an input space and let kt : X × X  R be symmetric and continuous. kt is Mercer, i.e., positive (semi) definite, if and only if any Gram (kernel scalar product) matrix built by restricting kt to any finite subset of X is positive (semi) definite. A Mercer kernel kt guarantees the existence of a reproducing kernel Hilbert space H where kt can be written as a dot product .e., t : X  H such that  x, x  X , i . kt (x, x ) = t (x), t (x ) Prop osition 2 ex. (S-Taylor & Cristianini, 2000) the sum and the product of any two Mercer kernels is a

i

k (xp , xq ) d(xp , xq ) + i j i j  i k (xp , xq ) log(k (xp , xq )) + i j i j  

Ip ,j Iq

Ip ,j Iq


i

Ip ,j Iq

x  k (xp , xq ) - j i  p i
,j

kq

 Np (xi ), q x  Nq (xj )

p

  k (xp , xq ) k 

s.t.

k (xp , xq )  [0, 1], i j

k (xp , xq ) = 1 i j (1)

The first term measures the quality of matching two descriptors f (xp ), f (xq ). In the case of SIFT, this i j is considered as the distance, d(xp , xq ), between the i j 128 SIFT coefficients of xp and xq . A high value of j i d(xp , xq ) should result into a small value of k (xp , xq ) j i j i and vice-versa. The second term is a regularization criterion which considers that without any a priori knowledge about the aligned samples, the probability distribution {k (xp , xq )} should be flat so the negative of the i j entropy is minimized. This term also helps defining a simple solution and solving the constrained minimization problem easily. The third term is a neighborhood criterion which considers that a high value of k (xp , xq ) j i


Robust Matching and Recognition using Context-Dep endent Kernels

Mercer kernel. The exponential of any Mercer kernel is also a Mercer kernel. Pro of. N 2000). see, for instance, (S-Taylor & Cristianini,

.3. Algorithm and Setting The factor  , in kt , acts as a scale parameter and it is selected using E ( r r r r r r 3)   Er {X1 ,X2 :d(X1 ,X2 ) } [d(X1 , X2 )]
r r here E denotes the expectation and X1 (also X2 ) denotes a random variable standing for samples in Sr . The coefficient  controls the tradeoff between the alignment quality and the neighborhood criteria. It is selected by cross-validation and it should guarantee kt (xp , xq )  [0, 1]. If i j k (xp , xp ) × g (xq , xq ),  should then A = supi,j j ,g i k be selected in [0, 2A ].

ow, let us state our result about the positive definiteness of the "CDK" kernel. Prop osition 3 consider g : X × X  R, let q V(xp , xp , xj , xq ) = g (xp , xp )g (xq , xq ), and k0 positive i i j k k definite. The kernel kt is then positive definite. Pro of. Initially (t = 0), k0 is per definition a positive definite kernel. By induction, let us assume kt-1  Mercer kernel, i.e., t-1 : a ) kt-1 (x, x ) = x, x  X . t-1 (x), t-1 (x Now, the sufficient condition wiill be to show that y ,) ) s also a Mercer ker,y V(x, y , x y kt-1 (y , y nel. Then, by the closure of the exponential and the product (see proposition 2), kt will then be Mercer. We need to show x1 , . . . , xd  X , c1 , . . . , cd  R,   y i (xi , y , xj , y ) kt-1 (y , y )  0 ci cj  () =
,j ,y
V

We have () = = i y ci cj
,j

y

,y

g

ci g (xi , y )
,y
i

(xi , y ) g (xj , y × 

)

kt-1 (y , y

)


y

= C =

y

(y , y ) ,y y H y t-1 (y )  0. y
kt-1


j

cj g (xj , y ) kt-1 (y , y

)

orollary 1 K defined in (1) is also a Mercer kernel. Pro of. the proof is straightforward for the particular case n = m. As kt (xp , xq i = t (xp ), t (xq ) , j i j) i p q we can write K (Sp , Sq ) = = ,j t (xi ), t (xj ) j p i t (xi ), t (xq ) and this corresponds to a dot j product in some Hilbert space. The proof can be found in (S-Taylor & Cristianini, 2000) for the general case of finite subsets of any length.

Figure 1. This figure shows a comparison of the matching results when using a naive matching strategy without geometry and our "context-dependent" kernel matching. (Top figures) show the distribution of the kernel values k(xi , xj ), j  Iq using a context-free kernel (left) and our "CDK" kernel (right). We can clearly see that the highest value changes its location so the matching results are now corrected (as shown for one particular and multiple matches in bottom figures).

Consider P , Q as the intrinsic adjacency matrices of Sp and Sq respectively defined as Pi,k = g (xp , xp ), i k Qj, = g (xq , xq ). Let U denote the unit matrix and j consider Di,j = d(xp , xq ), µi,j = kt (xp , xq ). Now, µi,j i j i j is iteratively found using Algorithm ("CDK") (see table 2) and converges to a fixed point (see. Section 2.4). 2
(t) (t)


Robust Matching and Recognition using Context-Dep endent Kernels

Figure 2. This figure shows the evolution of context-dependent silhouette matching on the Swedish set, for different and increasing values of . We clearly see that when  increases the matching results are better. We set  = 0.1 and t = 1.

Algorithm (CDK) Initialization: Set  using (3) and   [0, 2A ] Set µ(0)  k0 , t  0 Rep eat until t  Tmax or µ(t) - µ(t-1) 2  0 T - 2 (t) P µ(t-1) Q - U µ  exp D/ + 
able 2. The "CDK" kernel evaluation.

Pro of. The first assertion is proved by induction by checking that for v   1, we have fi,j (v )  exp  exp - 1+ 1+
2  2 A

-

k

q q p p , g (xi , xk )g (xj , x )vk,


1.

2.4. Convergence Let us assume 0  g  1, and remind µ(t)  Rn×m (t) q be the vector of components µi,j = kt (xp , xj ). Introi duce the mapping f : Rn×m  Rn×m defined by its component fi,j (v ) as - q d(xp , xj ) 2 k i (xp , xp )g (xq , xq )vk, + exp 1- B i j k  
, g

For the second assertion,  fi we have |  vk,,j (v )|  exp 2 , exp  A for any v , v in i f (v ) - f (v ) 1 = ()  i × exp k
- , g

note that for any v in B , - d(xp ,xq ) . i j Let C = 1-  B , we have |fi,j (v ) - fi,j (v )| = ()

,j

,j

-

(xp , xp )g (xq , xq )vk, i j k

q 2 d(xp , xj ) 2 i 1- exp A   


g (xp , xp )g (xq , xq )vk, i j k -
1

f

yµ construction of the kernel kt , we have µ . (t-1) Let A and B satisfy k sup (xp , xp )g (xq , xq )  A i j k
1in, 1j m , g

(t)

=  (4)

L v-v

i

exp
,j

1-

d(xp , xq ) 2 i j C v-v  

1

which proves the second assertion. The last assertion 3irectly comes from the fixed-point theorem. d

 d(xp , xq ) j i exp 1- B (5)  ,j 2 , Consider L = 2B  exp A and let   b v B=  Rn×m :  1  i  n, 1  j  m, |vi,j |  1 e the · -ball of radius 1. Finally, let · 1 denote 1 the 1-norm on Rn×m : u 1 = in, 1j n |ui,j |. i - Prop osition 4 If µ(0)   1 and 2A   , then we have f (B )  B , and on B , f is L-Lipschitz for the norm · 1.

. Performance
Experiments were conducted on the Swedish set (15 classes, 75 images per category) and a random subset of MNIST digit database (10 classes, 200 images per category). Each class in Swedish (resp. MNIST) is split into 50 + 25 (resp. 100 + 100) contours for training and testing. Interest points were sampled from each contour in MNIST (resp. Swedish) and encoded using the 60 (resp. 16) coefficients of the shape-context descriptor (Belongie et al., 2000). 3.1. Generalization and Comparison We evaluate kt , t  N+ using two initializations: (i) linear k0 (x, x ) = kl (x, x ) = x, x (ii) and

In particular, if L < 1, then there exists a unique v  B ~ such that f (v ) = v , and the sequence (µ(t) ) satisfies ~ ~ µ(t) - v 1  Lt µ(0) - v 1 - ~ ~
t+

0.

(6)


Robust Matching and Recognition using Context-Dep endent Kernels

polynomial k0 (x, x ) = kp (x, x ) = ( x, x + 1)2 . Our goal is to show the improvement brought when using kt , t  N+ , so we compared it against the standard context-free kernels kl and kp (i.e., kt , t = 0). For this purpose, we trained a "one-versus-all" SVM classifier for each class in both MNIST and Swedish using x ) the subset kernel K (Sp , Sq ) = Sp ,x Sq kt (x, x . The performance are measured, on different test sets, using n-fold cross-validation error (n = 5). We remind that  is set using (3) as the left-hand side of kt corresponds to the Gaussian kernel with scale  . In practice,  = 0.1. The influence (and the performance) of the right-hand side of kt increases as  increases (see. Figure 2), nevertheless the convergence of kt to a fixed point is guaranteed only if   [0, 2A ]. Therefore, it becomes obvious that  should be set k q q p p to 2A where A = supi,j , g (xi , xk ) × g (xj , x ) (in practice, 0  g  1 and A = 1). Table 3 shows the 5-fold cross validation errors on MNIST and Swedish for different iterations; we clearly see the out-performance and the improvement of the "CDK" kernel (kt , t  N+ ) with respect to the contextfree kernels used for initialization (k0 = kl or kp .)

Initialization Iterations (MNIST) k0 k1 k2 k3 k4 Iterations (Swedish) k0 k1 k2

Linear

Polynomial

11.4± 8.80± 6.90± 6.90± 6.90±

4.42 4.77 3.55 3.41 3.41

9.15 ± 4.63 5.6 ± 2.72 5.8 ± 2.36 5.2 ± 2.07 5.2 ± 2.07 6.53± 6.34 3.33± 2.73 3.33± 2.73

11.7± 2.88 6.00± 2.30 3.06± 1.88

Table 3. This table shows the mean and the standard deviation of the 5-fold error on the MNIST (top) and Swedish (bottom) databases. We can see a clear and a consistent gain through different iterations and also the convergence of the errors.

& Jones, 2001)) will reduce this complexity to O (s). Finally, the out-performance of our kernel comes essentially from the inclusion of the context. This strongly improves the precision and helps including the intrinsic properties (geometry) of ob jects. Even though tested only on visual ob ject recognition, our kernel can be extended to many other pattern analysis problems such as bioinformatics, speech and text. For instance, in text analysis and particularity machine translation (Sim et al., 2007), the design of a similarity kernel between words in two different languages, can be achieved using any standard dictionary. Of course, the latter defines similarity between any two words (we , wf ) independently from their bilingual training text (or bitext), i.e., the phrases where (we , wf ) might appear and this results into bad translation performances. A better estimate of similarity between two words (we , wf ), can be achieved using their context i.e., the set of words which cooccur frequently with (we , wf ) (Koehn et al., 2003).

4. Remarks and Discussion
The adjacency matrix P , in kt , provides the intrinsic properties and also characterizes the geometry of an ob ject Sp . Let us remind Np (xp ) = {xp  Sp : k = i, g (xp ) - g (xp ) 2  p} i i k k and Pi,j = 1{xq Np (xp )} . It is easy to see that P is j i translation and rotation invariant and can also be made scale invariant when p is adapted to the scale of g (xp ). It follows that the right-hand side of our keri nel is invariant to any 2D similarity transformation. Notice, also, that the left-hand side of kt involves similarity invariant descriptors f (xp ), f (xq ) so kt j i (and K ) is similarity invariant. One current limitation of our kernel kt resides in its evaluation complexity. Assuming kt-1 known, for a given pair xp , xq , this complexity i j m , is O ax(N 2 , s) where s is the dimension of f (xp ) i and N = maxi,p #{Np (xp )}. It is clear enough that i  when N < s, the complexity of evaluating our kernel is strictly equivalent to that of usual kernels such as the linear. Nevertheless, the worst case (N s) makes our kernel evaluation prohibitive and this is mainly due to the right-hand side of kt (xp , xq ) which j i requires the evaluation of kernel sums in a hypercube of dimension 4. A simple and straightforward generalization of the integral image (see for instance (Viola

5. Conclusion
We introduced in this paper a new type of kernels referred to as context-dependent. Its strength resides in the improvement of the alignments between interest points and this is considered as a preliminary step in order to increase the robustness and the precision of ob ject recognition. We have also shown that our kernel is Mercer and applicable to SVM learning. The latter, achieved for shape recognition problems, has better performance than SVMs with context-free kernels. Future work in-


Robust Matching and Recognition using Context-Dep endent Kernels

cludes the comparison of our kernel with other contextfree kernels and its application in scene and ob ject understanding using other standard databases.

Jaakkola, T., Diekhans, M., & Haussler, D. (1999). Using the fisher kernel method to detect remote protein homologies. ISMB (pp. 149­158). Koehn, P., Och, F. J., & Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of HLT/NAACL. Kondor, R., & Jebara, T. (2003). A kernel between sets of vectors. In proceedings of ICML. Lazebnik, S., Schmid, C., & Ponce, J. (2004). Semi-local affine parts for ob ject recognition. In BMVC. Lowe, D. (2004). Distinctive image features from scaleinvariant keypoints. In IJCV, 60, 91­110. Lyu, S. (2005). Mercer kernels for ob ject recognition with local features. In the proceedings of CVPR. Moreno, P., Ho, P., & Vasconcelos, N. (2003). A kullbackleibler divergence based kernel for svm classfication in multimedia applications. In NIPS. Mortensen, E., Deng, H., & Shapiro, L. (2005). A sift descriptor with global context. In CVPR (pp. 184­190). Nie, J.-Y., Simard, M., Isabelle, P., & Durand, R. (1999). Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. Proceedings of the 22nd ACM SIGIR conference on Research and development in information retrieval. S-Taylor, J., & Cristianini, N. (2000). Support vector machines and other kernel-based learning methods. Cambridge University Press. Sahbi, H., Audibert, J.-Y., Rabarisoa, J., & Keriven, R. (2007). Context-dependent kernel design for ob ject matching and recognition. Research Report N 2007D018, ENST Paris, ParisTech. Schmid, C., & Mohr, R. (1997). Local greyvalue invariants for image retrieval. In PAMI, 19, 530­535. Scholkopf, B., Tsuda, K., & Vert, J.-P. (2004). Kernel methods in computational biology. MIT Press. Shashua, A., & Hazan, T. (2004). Algebraic set kernels with application to inference over local image representations. In NIPS. Sim, K., Byrne, W., Gales, M., Sahbi, H., & Woodland, P. (2007). Consensus network decoding for statistical machine translation system combination. In ICASSP. Swain, M., & Ballard, D. (1991). Color indexing. International Journal of Computer Vision, 7, 11­32. Vapnik, V. N. (1998). Statistical learning theory. A WileyInterscience Publication. Viola, P., & Jones, M. (2001). Rapid ob ject detection using a boosted cascade of simple features. In CVPR. Wallraven, C., Caputo, B., & Graf, A. (2003). Recognition with local features: the kernel recipe. ICCV (pp. 257­ 264). Wolf, L., & Shashua, A. (2003). Learning over sets using kernel principal angles. In JMLR, 4, 913­931.

Acknowledgements
This work was supported in part by the Agence Nationale de la Recherche, pro ject "Mod`les Graphiques e et Applications" and the pro ject "Infom@gic".

References
Amores, J., Sebe, N., & Radeva, P. (2005). Fast spatial pattern discovery integrating boosting with constellations of contextual descriptors. In CVPR. Bahlmann, C., Haasdonk, B., & Burkhardt, H. (2002). Online handwriting recognition with support vector machines, a kernel approach. IWFHR (pp. 49­54). Barla, A., Odone, F., & Verri, A. (2002). Hausdorff kernel for 3d ob ject acquisition and detection. ECCV, LNCS 2353 (pp. 20­33). Belongie, S., Malik, J., & Puzicha, J. (2000). Shape context: A new descriptor for shape matching and ob ject recognition. In NIPS. Bishop, C. (2007). Pattern recognition and machine learning. Springer. Boser, B., Guyon, I., & Vapnik, V. (1992). An training algorithm for optimal margin classifiers. In Fifth Annual ACM Workshop on Computational Learning Theory, Pittsburgh, 144­152. Boughorbel, S. (2005). Kernels for image classification with support vector machines. Doctoral dissertation, PhD. Thesis, Faculte d'Orsay. Chapelle, O., Haffner, P., & Vapnik, V. (1999). Svms for histogram-based image classification. Transaction on Neural Networks, 10. Cuturi, M. (2005). Etude de noyaux de semigroupe pour objets structures dans le cadre de l'apprentissage statistique. Doctoral dissertation, PhD thesis, ENSMP. Everingham, M., Gool, L. V., Williams, C., Winn, J., & Zisserman, A. (2007). The PASCAL Visual Ob ject Classes Challenge 2007 (VOC2007) Results. http://www.pascalnetwork.org/chal lenges/VOC/voc2007/workshop/. Gartner, T. (2003). A survey of kernels for structured data. Multi Relational Data Mining, 5, 49­58. Grauman, K., & Darrell, T. (2007). The pyramid match kernel: Efficient learning with sets of features. In JMLR, 8, 725­760. Harris, C., & Stephens, M. (1988). A combined corner and edge detector. Alvey Vision Conference (pp. 147­151). Haussler, D. (1999). Convolution kernels on discrete structures. Technical Report UCS-CRL-99-10, UC Santa Cruz.