A Generalization of Haussler's Convolution Kernel -- Mapping Kernel
Kilho Shin Carnegie Mellon CyLab Japan, 3-3 Higashi-kawasaki-cho, Chuo-ku, Kobe, Hyogo, Japan Tetsuji Kub oyama Gakushuin University, Mejiro, Toshima-ku, Tokyo, Japan yshin@cmuj.jp

kuboyama@gakushuin.ac.jp

Abstract
Haussler's convolution kernel provides a successful framework for engineering new positive semidefinite kernels, and has been applied to a wide range of data types and applications. In the framework, each data ob ject represents a finite set of finer grained components. Then, Haussler's convolution kernel takes a pair of data objects as input, and returns the sum of the return values of the predetermined primitive positive semidefinite kernel calculated for all the possible pairs of the components of the input data ob jects. On the other hand, the mapping kernel that we introduce in this paper is a natural generalization of Haussler's convolution kernel, in that the input to the primitive kernel moves over a predetermined subset rather than the entire cross product. Although we have plural instances of the mapping kernel in the literature, their positive semidefiniteness was investigated in caseby-case manners, and worse yet, was sometimes incorrectly concluded. In fact, there exists a simple and easily checkable necessary and sufficient condition, which is generic in the sense that it enables us to investigate the positive semidefiniteness of an arbitrary instance of the mapping kernel. This is the first paper that presents and proves the validity of the condition. In addition, we introduce two important instances of the mapping kernel, which we refer to as the size-ofindex-structure-distribution kernel and the editcost-distribution kernel. Both of them are naturally derived from well known (dis)similarity measurements in the literature (e.g. the maximum agreement tree, the edit distance), and are reasonably expected to improve the performance of the existing measures by evaluating their distributional features rather than their peak (maximum/minimum) features.

primitive kernels to the context of specific applications. In this section, we first review a degenerated form of Haussler's convolution kernel, which proves in fact to be equivalent to the general form of Haussler's convolution kernel (see 2.2). Let each data point x in a space  be associated with a finite subset x of a common space  . Furthermore, we assume that a kernel k :  ×   R is given. Then, Haussler's convolution kernel K :  ×   R is defined as follows (see 2.2). ( K (x, y ) = k (x , y ) (1)
x ,y )x ×y

Haussler proved that, if k (x , y ) is positive semidefinite, then so is K (x, y ). Haussler's convolution kernel is known to have a wide range of application (Lodhi et al., 2001; Collins & Duffy, 2001; Suzuki et al., 2004). On the other hand, the mapping kernel is a natural generalization of Haussler's convolution kernel, and is defined by Eq. (2) for {Mx,y x × y | (x, y )  2  }. The problem that the present paper addresses is to determine whether the mapping kernel is positive semidefinite. ( K (x, y ) = k (x , y ) (2)
x ,y )Mx,y

1. Introduction
Haussler's convolution kernel (Haussler, 1999) has been used as a general framework to tailor known
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).

The main contribution of the present paper is to present a necessary and sufficient condition for the mapping kernel K (x, y ) defined by Eq. (2) to be positive semidefinite for all possible choices of positive semidefinite k (x , y ). More specifically, we prove that the condition is that the mapping system {Mx,y | (x, y )  2 } is transitive, i.e., (x , y )  Mx,y (y , z )  My,z  (x , z )  Mx,z . Haussler's convolution kernel is indeed the special case of the mapping kernel for {Mx,y = x × y }, which is apparently transitive. We see plural instances of the mapping kernel in the literature, and some of them were mistreated in respective manners.


A Generalization of Haussler's Convolution Kernel

· Although the elastic tree kernel (Kashima & Koyanagi, 2002) was treated as an instance of Haussler's convolution kernel, it is, in fact, an instance of the mapping kernel. Therefore, the positive semidefiniteness of the kernel should not have been determined based on Haussler's theorem. · The codon-improved kernel (Zien et al., 2000) was claimed to be unconditionally positive semidefinite, since it was viewed as an instance of the polynomial kernel. The kernel, in fact, is an instance of the mapping kernel under certain settings of weights. That is to say, the positive semidefiniteness of the aforementioned kernels were concluded on wrong grounds, and in fact, the conclusion regarding the codon-improved kernel is wrong -- in reality, it is not necessarily positive semidefinite. The kernels introduced in (Menchetti et al., 2005) and (Kuboyama et al., 2006) are also instances of the mapping kernel. In contrast to the elastic and codonimproved kernels, their positive semidefiniteness was properly investigated, albeit in specific manners. This is the first paper that recognizes the mapping kernel as a generic class of kernels, and presents a necessary and sufficient condition that a mapping kernel becomes positive semidefinite. Furthermore, the condition is simple, intuitive and easy to check, and therefore, would make engineering of new instances of the mapping kernel easier, more efficient and more effective to a large extent. As the second contribution of the present paper, we take advantage of the mapping kernel, and present a way to augment a couple of well-known frameworks to engineer similarity functions for discretely structured ob jects (e.g. strings, trees, general graphs). It is known that the maximum sizes of shared substructures of the ob jects can be used as a good measure of similarities of the ob jects. The maximum agreement subtree is a good example. Also, the edit distance has been applied to various types of ob jects. An edit distance between two data ob jects is generally defined as the minimum cost of edit scripts that transform one ob ject into the other. These two frameworks are common in that they only focus on the maximum/minimum values of the similarity measures (i.e. the sizes of shared substructures and the costs of edit scripts), and therefore, only those substructures with the maximum sizes or those edit scripts with the minimum costs can contribute to the similarity functions. It is, however, reasonably presumable that distributional features of the measurements may

carry useful information with regard to similarities of ob jects, and more accurate similarity functions can be engineered by evaluating the distributional features. Based on the aforementioned consideration, we introduce two novel classes of kernels (similarity functions) each evaluating the distributional features of the sizes of shared substructures or the costs of edit scripts. Also, we show a general way to view them as mapping kernels. By virtue of our simple criteria for positive semidefinite mapping kernels, we can easily determine whether instances of the new kernel classes are positive semidefinite, and, if they are, we can take advantage of sophisticated classifiers such as support vector machines (SVM). In 3.1 and 3.2, we see that the examples of distribution-based similarity functions derived from maximum agreement subtrees and general tree edit distances are positive semidefinite, while those derived from maximum refinement trees (Hein et al., 1996) and less-constrained tree edit distance (Lu et al., 2001) are not.

2. The Mapping Kernel
In this section, as a preliminary, we quickly review the positive semidefinite kernel (2.1) and Haussler's convolution kernel (2.2). Then, we describe our main theorem with regard to the mapping kernel (2.3). 2.1. The Positive Semidefinite Kernel A kernel K :  ×  - R is said to be positive semidefinite, if, and only if, for arbitrary x1 , . . . , xn  , the corresponding Gram matrix G = [K (xi , xj )]i,j =1,...,n is a positive semidefinite matrix. Positive semidefiniteness of kernels is a critical condition for reproducing kernel Hilbert spaces to exist. In simpler cases where a data point space  is finite, this condition is equivalent to the property that there exists a mapping T  :  - RN such that K (x, y ) = (x)(y ) . In this paper, by a positive semidefinite matrix, we mean a real symmetric matrix (i.e. AT = A) that satisfies one of, hence, all of the mutually equivalent conditions stated below, where dim A = n. · (c1 , . . . , cn )A(c1 , . . . , cn )  0 for (c1 , . . . , cn )  Rn . · A has only non-negative real eigenvalues. · There exists an n-dimensional orthogonal matrix P (i.e. P T P = En ) such that P T AP is a diagonal matrix with non-negative elements. · A = B T B for some m × n real matrix B .
T


A Generalization of Haussler's Convolution Kernel

2.2. Haussler's Convolution Kernel Hausler's theorem (Haussler, 1999, Theorem 1) asserts the positive semidefiniteness of Haussler's Rconvolution kernel, and Theorem 1 presents its special case for D = 1. Theorem 1. Let k :  ×  R be a positive semidefinite kernel. Given a relation R  ×, K : ×  R defined by Eq. (3) is also positive semidefinite. ( ( K (x, y ) = k (x , y ) (3)
x ,x)R y ,y )R

It is interesting to note that Haussler's theorem for D > 1 is obtained as a corollary to Theorem 1. Corollary 1. (Haussler, 1999) Let kd : d × d - R be positive semidefinite kernels for d = 1, . . . , D. Given a relation R  1 × · · · × D × , the kernel K :  ×  - R defined below is also positive semidefinite. K (x, y ) = (
x1 ,...,xD ,x)R

Definition 1. A mapping system M is a triplet (, {x | x  }, {Mx,y x × y | (x, y )  2 }) such that |Mx,y | <  and (y , x )  My,x if (x , y )  Mx,y . Definition 2. A mapping system (, {x }, {Mx,y }) is said to be transitive, if, and only if, (x1 , x2 )  Mx1 ,x2  (x2 , x3 )  Mx2 ,x3  (x1 , x3 )  Mx1 ,x3 holds for arbitrary xi   and xi  xi (i = 1, 2, 3). Definition 3. An evaluating system E for a mapping system (, {x }, {Mx,y }) is a triplet ( , k , {x | x  }) with a positive semidefinite underlying kernel k :  ×   R and projections x : x   . Definition 4. For a mapping system M = (, {x }, {Mx,y }) and an evaluating system E = ( , k , {x }) for M, the mapping kernel with respect to M and E is defined by Eq. (4). ( K (x, y ) = k (x (x ) , y (y )) (4)
x ,y )Mx,y

(
y1 ,...,yD ,y )R

dD
=1

kd (xd , yd )

Now, our main theorem is described as follows, and its proof is given in Section 4. Theorem 2. For a mapping system M, the fol lowing are equivalent to each other. 1. M is transitive. 2. For an arbitrary evaluating system E for M, the mapping kernel with respect to M and E is positive semidefinite. It is possible to prove (1)  (2) of Theorem 2 as a corollary to Theorem 1. Nevertheless, our direction in the present paper is opposite -- we like to view Theorem 1 as a trivial corollary to Theorem 2. In fact, we will prove Theorem 2 without assuming Theorem 1 in Section 4.

2.3. Definition and Main Theorem Letting x denote {x   | (x , x)  R}, Eq. (1) gives an equivalent form of Eq. (3). On the other hand, the mapping kernel is defined so that (x , y ) moves over a subset Mx,y of x × y rather than the entire cross product x × y (Eq. (2)). The present paper shows that the mapping kernel is positive semidefinite for all possible choices of positive semidefinite underlying kernels k , if, and only if, {Mx,y | x, y  } is transitive (Definition 2). Therefore, for an arbitrary non-transitive {Mx,y }, a positive semidefinite underlying kernel k (x , y ) exists such that the resulting K (x, y ) is not positive semidefinite (4.1.2). On the other hand, K (x, y ) may be positive semidefinite even for a non-transitive {Mx,y } and a positive semidefinite k (x , y ) (Example 1). Example 1. The (k , m)-mismatch kernel K(k,m) (x, y ) is positive semidefinite (Leslie et al., 2004). When x and y denote the sets of k -mers in x and y , K(k,m) (x, y ) can be regarded as a mapping kernel for the non-transitive {Mx,y } defined as follows. Mx,y = {(x , y ) | K(k,m) (x , y ) = 0} K(k,m) (x, y ) = (
x ,y )Mx,y

3. Similarity Functions Based on Distributions
In this section, we introduce two new classes of the mapping kernel. The kernels are expected to improve the classification performance of known similarity measurements by evaluating their distributional features. 3.1. Size-of-index-structure-distribution Kernels When some structures are commonly derived from two data ob jects, the structures may carry information with regard to similarities between the data ob jects. In this paper, we call such structures index structures. The agreement subtree is a good example of the index structure, when data ob jects are represented as

x × y
)

K(k,m) (x , y

The result is formalized as follows.


A Generalization of Haussler's Convolution Kernel

trees. An agreement subtree between plural input trees is usually defined as a subtree homeomorphical ly included in all the input trees (Berry & Nicolas, 2004). In the present paper, we assume that the input trees are a pair of trees. Even when we fix the input tree pair, there may exist more than one agreement subtree, and the maximum size of the agreement subtrees can be naturally viewed as a measure of similarities between the input trees. The maximum agreement subtrees (MAST) problem is the problem to determine at least one agreement subtree with the maximum size among the possible agreement subtrees for the input trees. The MAST problem has been extensively studied from the application point of view (e.g. evolutionary trees (Hein et al., 1996; Berry & Nicolas, 2004), shape-axis trees (Pelillo, 2002)) as well as from the algorithm efficiency point of view. When using the size of the maximum agreement subtrees as a similarity measurement between trees, we discard those agreement subtrees smaller in size than the maximum ones, and therefore, they do not contribute to the final evaluation at all. It is, however, reasonable to think that distributional features of the sizes of agreement subtrees may carry useful information with regard to similarities of the trees. Based on the aforesaid consideration, we introduce the kernel of Eq. (5), which evaluates distributional features of the sizes of agreement subtrees. In Eq. (5), we let AST(x, y ) denote the set of the agreement subtrees between x and y , and f : N  R+ = {y  0 | y  R} be an increasing function. K (x, y ) = t
AST(x,y )

It is easy to see that {Mx,y } is transitive and k (x , y ) is positive semidefinite. Hence, ( t k (x , y ) K (x, y ) = f (size of (t)) =
AST(x,y ) x ,y )Mx,y

is positive semidefinite by Theorem 2. Besides the maximum agreement subtree, the maximum refinement subtree (Hein et al., 1996; Berry & Nicolas, 2004), maximum subtree isomorphism (Pelillo, 2002; Aoki et al., 2003) and maximum agreement supertree (Jansson et al., 2005) are also used as index structures for trees. As for general graphs, the maximal common clique included in an input pair of graphs is also studied in association with MAST in (Pelillo, 2002). For each of those index structures, we can define kernels in the same way as for MAST. We have only to replace AST(x, y ) in Eq. (5) with the set of the respective index structures. Moreover, except for the maximum refinement subtree, through the same discussion as for MAST, the kernels prove to be positive semidefinite. Interestingly, Theorem 2 also implies that the kernels defined based on the minimum refinement subtree are not necessarily positive semidefinite. The minimum x y refinement subtree for x and y is defined as the minimum tree t such that both x and y can be derived from t through a sequence of edge contractions, and the maximum refinement subtree problem (a.k.a. the maximum compatible tree problem) is the problem to find a minimum refinement subtree with the largest size. Different from the agreement subtree, the relation of having a refinement is not an equivalence relation -- even if x and y , and y and z , have refinement subtrees, x and z do not necessarily have a refinement subtree. This implies that the corresponding Mx,y is not necessarily transitive. Therefore, Theorem 2 asserts that the corresponding K (x, y ) is not necessarily positive semidefinite. 3.2. Edit-cost-distribution Kernels The Edit distance is also used as an effective measure of similarities between discrete data structures (e.g. (Wagner & Fischer, 1974) for strings, (Barnard et al., 1995) for trees, (Bunke, 1997) for general graphs). Let x be an ob ject consisting of one or more components. For example, a string consists of one or more characters which are laid out on a line. For another example, a graph consists of one or more vertices and ) ,edges, and each edge connects a vertex to another. We first give a general definition of edit operations, edit

f (size of (t))

(5)

If x and y are rooted trees of bounded degree, and if f (n) = n or f (n) = n, for example, there exist polynomial-time efficient algorithms to calculate K (x, y ). Beside the advantages due to the distributional features, the kernel could provide the advantage of using sophisticated classifiers such as SVM (Cristianini & Shawe-Taylor, 2000). In fact, our contribution asserts that K (x, y ) is positive semidefinite as follows. First, K (x, y ) can be viewed as a mapping kernel under the following notation. · x is the set of the subtrees of x. · Mx,y = {(x , y )  x × y | x  y }, where = x  y means that they are homeomorphic as = trees. f (size of (x )) if size of (x ) = size of (y ,) · k (x y = 0 otherwise.


A Generalization of Haussler's Convolution Kernel

scripts, edit costs and edit distances for such ob jects. An edit operation is an operation on a component of x, and is one of (i) substituting a component b for a component a of x (denoted by a  b ), (ii) deleting a component a of x (denoted by a  · ), and (iii) inserting a component a into x (denoted by ·  a ). An edit script is a sequence of zero or more edit operations which transforms an ob ject into another. When a cost  a  b  R is given for each edit operation a  b 1, the cost  ( ) of an edit script  is the sum of the costs of the edit operations that comprise  . Finally, an edit distance d(x, y ) between ob jects x and y is defined by: d(x, y ) = min{ ( ) |  transforms x into y }. Therefore, those edit scripts with larger costs than the minimum cost do not contribute to the final edit distance. In contrast, by introducing kernels by Eq. (6) with a decreasing function f : R+  R+ , we try to take advantage of the information that those discarded edit scripts potentially carry.  K (x, y ) = f ( ( )) (6)
transforms x into y

1. The cost function is symmetric (i.e.  a  b =  b  a ). 2. We let f (x) = e-cx for some positive constant c. 3. In order to avoid calculating infinite sums, we take only irreducible edit scripts into consideration in calculating Eq. (6) -- Assume that  = x1  y1 . . . xn  yn transforms x into y .  is irreducible, if, and only if, (1) xi (resp. yi ) is either a component of x (resp. y ) or · and (2) exactly one edit operation xi  yi is applied to each component of x and y . 4. If two irreducible edit scripts differ from each other only in the order of the included edit operations, they are identified in calculating Eq. (6), that is, they are evaluated only once. For  = x1  y1 . . . xn  yn , we assume that xi and yi are respectively components of x and y , if, and only if, i  {1, . . . , m( )}, and call x1  y1 · · · xm( )  ym() the core of  . Then,  ( ) and K (x, y ) are evaluated as follows.
m( )

 ( )
K

=

i

( xi  yi -  xi  · -  ·  yi )
=1

+

x

x
x


·+

y
y

 ·y

It is important to note that there exists a natural interpretation of Eq. (6). In a natural setting where the cost  a  b is defined as the negative logarithm of the probability that the substitution of b for a (a or b could be ·) would occur (e.g. (Li & Jiang, 2005; Salzberg, 1997)), we let f (x) = e-x . For an edit script  = x1  y1 · · · xn  yn transforming x into y , f ( ( )) is evaluated as follows. n i i = f ( ( )) = e- () = e- i=1  x y n in i i e- i=1 - log Pr(x y ) = Pr(xi  yi )
=1

(x, y ) =   f (   · ) ·  
x

(7) f ( ·   ) ·
y

 
m( )

i

=1

 f ( xi  yi )  f ( xi  · )f ( ·  yi )

Hence, K (x, y ) by Eq. (6) equals the total probability that x would be transformed into y . Usage of sophisticated classifiers such as SVM is another potential advantage of the kernels of the form of Eq. (6). In fact, as shown below, the kernels can be viewed as mapping kernels, if we can pose the following four assumptions.
Usually, components are labeled with elements of an alphabet, and costs of edit operations are defined on the labels rather than on the components. However, for simplicity, we assume that the cost function is defined over the space of ob jects in the present paper. In addition, to make the resulting edit distance be a distance metric, the costs are often assumed to be a distance metric.
1

In Eq. (7), the first two factors of the right-hand side are functions of x and y , and therefore, we denote them by g (x) and g (y ), respectively. On the other hand, the last factor is a function of x = (x1 , . . . , xm() ) and y = (y1 , . . . , ym() ), and is denoted by k (x , y ). We define Mx,y as follows. Mx,y = {((x1 , . . . , xm ), (y1 , . . . , ym )) |  [ x1  y1 · · · xm  ym is the core of  ]} Then, the following holds  K (x, y ) = = g (x) · g (y ) ·  (
x ,y )Mx,y

 k (x , y )

¯ g (x) · g (y ) · K (x, y )

¯ In particular, K (x, y ) is a mapping kernel, and K (x, y ) ¯ is positive semidefinite, if, and only if, so is K (x, y ),


A Generalization of Haussler's Convolution Kernel

¯ since g (x) cannot take the value 0. The kernel K (x, y ), however, is not necessarily positive semidefinite, even if k (x , y ) is positive semidefinite, since {Mx,y } is not necessarily transitive. We will investigate this problem taking the tree edit distance as an example. For the tree edit distance, the edit operations act on vertices of trees. For a pair (x , y ) to be the core of some irreducible tree edit script, it is necessary and sufficient that  defined by (xi ) = yi preserves the ancestor-descendent relation and the sibling (left-toright) relation (Tai, 1979). Therefore, Mx,y for the general tree edit distance is defined as follows, where xj xi < xj means xj is an ancestor of xi and xi i j means x is located on the right side of x . Mx,y = {((x1 , . . . , xm ), (y1 , . . . , ym )) | [xi < xj  yi < yj ]  [xi xj  yi (8) y ]}
j

Prop osition 1. For an m-dimensional square matrix A, the fol lowing are equivalent to each other. 1. A is positive semidefinite. 2. smryA (X ) is positive semidefinite for an arbitrary mn-dimensional positive semidefinite matrix X . Proof. First, we prove the assertion assuming that A is diagonal, whose I -th diagonal element is I . The condition 2 implies 1, since we see I  0 for any I by letting X be the sparse matrix such that Xkl is 1, if k = l = I , and 0, otherwise. On the other hand, the converse follows from Eq. (10), since smryA (X ) = Z T Z holds for the m2 n × n matrix  Z such that Zmn(I -1)+m(k-1)+i,j = I Yikj , where Y I is an mn-dimensional matrix such that X = Y T Y . n ( Im k lm T ij ki kj tr A X = I 10) YlI YlI
=1 =1 =1

It is straightforward to verify that {Mx,y } is transitive. Therefore, Theorem 2 asserts that, if k (x , y ) is ¯ positive semidefinite, so is K (x, y ) for this {Mx,y }. On the other hand, two subclasses of the general tree edit distance have been proposed. They are constrained (a.k.a. structure-preserving) tree edit distance (Zhang, 1995) and less-constrained (a.k.a. alignable) tree edit distance (Lu et al., 2001). Those subclasses of the general tree edit distance determine respective Mx,y , which are generally proper subsets of those define by (8). Since {Mx,y } for the constrained tree edit distance is easily verified to be ¯ transitive, the resulting K (x, y ) turns out positive semidefinite by virtue of Theorem 2. In contrast to the constrained edit distance, {Mx,y } for the lessconstrained tree edit distance is not transitive. There¯ fore, Theorem 2 implies that K (x, y ) is not necessarily positive semidefinite.

=

Im k n l m   ( I Ylki )( I Ylkj ) I I
=1 =1 =1

The general cases for non-diagonal A reduces to the diagonal case, since, for P such that P T AP is di~ ~ agonal, smryA (X ) = smryP T AP (X ) holds for X = [P T X ij P ]i,j =1,...,n . 4.2. (1) Implies (2) Investigating whether K is positive semidefinite is equivalent to investigating whether the Gram matrices for finite subsets of  are positive semidefinite. Therefore, without any loss of generality, we may assume that  is a finite set {x1 , . . . , xn }. Since Mxi ,xj are finite, we may also assume xi are finite. We slightly extend the definition of ( , k , {x }) by adding a new element ·   such that k (·, ·) = k (·, x ) = k (x , ·) = 0 hold for an arbitrary x   . Even after the extension, ( , k , {x }) still remains an evaluating system for M. ¯ ¯ Next, we define n , M and {x } as follows:  is the ¯ ¯ x d disjoint union i=1  i ; x enotes the image of x  ¯ ¯¯ x in  ; M = {(x , y ) | (x , y )  Mx,y  x, y  }; ¯¯ ¯ x :  -  satisfies that x (x ) = x (x ), if x  x , ¯ ¯¯ and x (x ) = ·, otherwise. Then, the mapping kernel ¯¯ K with respect to M and E is rewritten as follows. ( K (x, y ) = k (x (x ), y (y )) ¯¯¯¯
¯ x ,y )M ¯¯

4. Pro of of Theorem 2
4.1. Key Lemma Let X ij be m-dimensional square matrices parameterized by (i, j ) = {1, . . . , n}2 , and let X denote the derived mn-dimensional square matrix [X ij ]i,j =1,...,n -- the (m(i - 1) + k , m(j - 1) + l)-element of X , denoted i by Xkj , is defined to be the (k , l)-element of X ij . l Furthermore, for an m-dimensional square matrix A, smryA (X ) denotes the n-dimensional square matrix [tr(AT X ij )]i,j =1,...,n . Note that the (i, j )-element of smryA (X ) is given by Eq. (9). tr(AT X ij ) = km lm
i Akl Xkj l

(9)

=1 =1

Furthermore, K (xi , xj ) = tr(AT X ij ) holds, when we ¯ define m-dimensional matrices A and X ij for  =


A Generalization of Haussler's Convolution Kernel

¯ {x1 , . . . , xm }. Akl = 1 if (xk , xl )  M , and Akl = 0 ¯ ¯ ¯¯ ij k l otherwise; Xkl = k (xi (x ), xj (x )). ¯¯¯¯ To show the assertion, it suffices to prove A is positive semidefinite by Proposition 1 (X = [X ij ]i,j =1,...,n is positive semidefinite by definition). A is symmetric, since (x , y )  Mx,y  (y , x )  My,x holds. The hypothesis that {Mx,y } is transitive implies that {1, . . . , m} is decomposed into U1 · · · UM such that: ¯ Ua  Ub = , if a = b; (xk , xl )  M , if, and only ¯¯ if, k , l  Ua for some a  {1, . . . , M }. Therefore, M A= a=1 A[Ua ] holds, and therefore, A is p ositive semidefinite, since so are A[Ua ]. 4.3. (2) Implies (1) We prove the cotraposition of the assertion. If M is not transitive, A includes at least one of the following sub-matrices (without any loss of generality, we may assume k < l < b), where A[i1 , . . . , in ] denote the ndimensional matrix whose (,  )-element is Ai ,i . 1 10 1 1 A[k , l] = 10   110 A[k , l, b] = 1 1 1 011 A[k , l] = 0 ( 11) ( 12)

define positive semidefinite  1 K = [e1 , e2 , e3 ] 0 0  tr(A[k , l, b] K )  1 0 = tr  0 2 0 0
T

K as follows.  00 0 0 [e1 , e2 , e3 ]T 00

  0 100 0  0 0 0 = 1 < 0 3 000

Now, we define E = ( , k , {x }) as follows. {·,  ,  ,  }  k ( ,  ) k ( ,  ) k ( ,  ) · k ( ,  ) k ( ,  ) k ( ,  ) = K k ( ,  ) k ( ,  ) k ( ,  )   , if x = xi and x = xk ,    , if x = x and x = xl , j · x ( x ) =  , if x = xa and x = xb ,    ·, otherwise. Below, we investigate three cases: the indices take the same value, that is, i = j = a; two of the indices coincide with each other, where we can assume i = j = a without loss of generality: the indices are different from one another, that is, i = j = a = i. For each case, we see that some diagonally located submatrix of smryA (X ) is not positive semidefinite. This implies that smryA (X ) itself is not positive semidefinite. Case i = j = a: The submatrix smryA (X )[i] is not positive semidefinite. smryA (X )[i] = tr(A[k , l, b] K ) < 0 Case i = j = a: We will show that smryA (X )[i, k ] is not positive semidefinite. tr(AT X ii ) tr(AT X ia ) tr(AT X ai ) tr(A X ) s  tr mryA (X )[i, a]
T aa T

· 

=

(13)

Note that any of them has a negative eigenvalue, since detA < 0 holds. We will see that there exists an instance of E = ( , k , {x }) such that smryA (X ), which is the Gram matrix for , is not positive semidefinite, if any of the above three cases occurs. In the remaining of this section, we will give a proof only for the case where Eq. (13) holds. The assertion for the simpler cases, that is, where either Eq. (11) or (12) holds, can be proved in almost the same way. Let i, j and a denote the indices such that x   i , ¯ xl  xj and xb  xa (be reminded that  is defined x as the disjoint union of  for x  ). The indices are not necessarily different from each other. Further, let column vectors e1 , e2 and e3 be an orthogonal basis of R3 such that the following holds.   1 0 0 T [e1 , e2 , e3 ] A[k , l, b][e1 , e2 , e3 ] =  0 2 0  0 0 3 We assume 1 < 0 without any loss of generality, and
k x

= tr(A[k , l, b] [1, 2]K [1, 2]) = A[k , l, b] = A[k , l, b] = A[k , l, b] 1
T T T 1,3 K1,3 3,1 K3,1 3,3 K3,3

T

+ A[k , l, b]

T T

2 , 3 K2 , 3 3 , 2 K3 , 2

+ A[k , l, b]

1 11

= tr(A[k , l, b] K ) < 0
T

By Proposition 1 smryA (X )[i, a] turns out not to be positive semidefinite. Case i = j = a = i: For ,  = 1, 2, 3, the (,  )element of smryA (X )[i, j, a] coincides with


A Generalization of Haussler's Convolution Kernel

A[ k , l , b ]

T

, K, .

  111 tr smryA (X )[i, j, a] 1 1 1 111 = tr(A[k , l, b] K ) < 0 By Proposition 1, smryA (X )[i, j, a] turns out not to be positive semidefinite.
T


Kuboyama, T., Shin, K., & Kashima, H. (2006). Flexible tree kernels based on counting the number of tree mappings. Proc. of Machine Learning with Graphs. Leslie, C. S., Eskin, E., Cohen, A., Weston, J., & Noble, W. S. (2004). Mismatch string kernels for discriminative protein classification. Bioinformatics, 20. Li, H., & Jiang, T. (2005). A class of edit kernels for svms to predict translation initiation sites in eukaryotic mrnas. Trans. on Comput. Syst. Bio. II, LNBI 3680, 48 ­ 58. Lodhi, H., Shawe-Taylor, J., Cristianini, N., & Watkins, C. J. C. H. (2001). Text classificatio using string kernels. Advances in Neural Information Processing Systems, 13. Lu, C. L., Su, Z.-Y., & Tang, G. Y. (2001). A New Measure of Edit Distance between Labeled Trees. LNCS (pp. pp. 338­348). Springer-Verlag Heidelberg. Menchetti, S., Costa, F., & Frasconi, P. (2005). Weighted decomposition kernel. Proc. of the 22nd International Conference on Machine Learning. Pelillo, M. (2002). Matching free trees, maximal cliques, and monotone game dynamics. IEEE Transactions on Pattern Analysis and Machine Intel ligence, 24, 1535 ­ 1541. Salzberg, S. L. (1997). A method for identifying splice sites and translational staqr sites in eukaryotic mrna. Computer Applications in the Biosciences, 13, 365 ­ 376. Suzuki, J., Isozaki, H., & Maeda, E. (2004). Convolution kernels with feature selection for natural language processing tasks. Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 119­126). Tai, K. C. (1979). The Tree-to-Tree Correction Problem. JACM, 26, 422­433. Wagner, R., & Fischer, M. (1974). The string-to-string correction problem. JACM, 21, 168­173. Zhang, K. (1995). Algorithms for the constrained editing distance between ordered labeled trees and related problems. PR, 28, 463­474. Zien, A., R¨tsch, G., Mika, S., Sch¨lkopf, B., a o Lengauer, T., & Muller, K. R. (2000). Engineer¨ ing support vector machne kernels that recognize translation initiation sites. Bioinformatics, 16, 799 ­ 807.

References
Aoki, K. F., Yamaguchi, A., Okuno, Y., Akutsu, T., Ueda, N., Kanehisa, M., & Mamitsuka, H. (2003). Efficient tree-matching methods for accurate carbohydrate database query. Genome Informatics, 14, 134 ­ 143. Barnard, D., Clarke, G., & Duncan, N. (1995). Treeto-tree correction for document trees (Technical Report 95-375). Queen's University, Kingston, Ontario K7L 3N6 Canada. Berry, V., & Nicolas, F. (2004). Maximum Agreement and Compatible Supertrees (Extended Abstract). CPM (pp. 205­219). Bunke, H. (1997). On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters, 18, 689­694. Collins, M., & Duffy, N. (2001). Convolution kernels for natural language. Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001] (pp. 625­632). MIT Press. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press. Haussler, D. (1999). Convolution kernels on discrete structuresUCSC-CRL 99-10). Dept. of Computer Science, University of California at Santa Cruz. Hein, J., Jiang, T., Wang, L., & Zhang, K. (1996). On the complexity of comparing evolutionary trees. Discrete Applied Mathematics, 71, 153 ­ 169. Jansson, J., Ng, J. H. K., Sadakane, K., & Sung, W. K. (2005). Rooted maximum agreement supertrees. Algorithmica, 293 ­ 307. Kashima, H., & Koyanagi, T. (2002). Kernels for semistructured data. the 9th International Conference on Machine Learning (ICML 2002) (pp. 291­298).