Nonextensive Entropic Kernels

Andr´ F. T. Martins  e afm@cs.cmu.edu M´rio A. T. Figueiredo a mario.figueiredo@lx.it.pt a Pedro M. Q. Aguiar guiar@isr.ist.utl.pt  Noah A. Smith nasmith@cs.cmu.edu Eric P. Xing epxing@cs.cmu.edu  School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA  Instituto de Telecomunica¸~es / Instituto de Sistemas e Rob´tica, Instituto Superior T´cnico, Lisboa, Portugal co o e

Abstract
Positive definite kernels on probability measures have been recently applied in structured data classification problems. Some of these kernels are related to classic information theoretic quantities, such as mutual information and the Jensen-Shannon divergence. Meanwhile, driven by recent advances in Tsallis statistics, nonextensive generalizations of Shannon's information theory have been proposed. This paper bridges these two trends. We introduce the Jensen-Tsal lis q -difference, a generalization of the JensenShannon divergence. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, Jensen-Shannon, and linear kernels as particular cases. We illustrate the performance of these kernels on text categorization tasks.

approaches that map data to a statistical manifold, where well-motivated non-Euclidean metrics may be defined (Lafferty & Lebanon, 2005), outperform SVM classifiers with linear kernels (Joachims, 1997). Some of these kernels have a natural information theoretic interpretation, creating a bridge between kernel methods and information theory (Cuturi et al., 2005; Hein & Bousquet, 2005). We reinforce that bridge by introducing a new class of kernels rooted in nonextensive (NE) information theory. The Shannon and R´nyi entropies (R´nyi, 1961) e e share the extensivity property: the joint entropy of a pair of independent random variables equals the sum of the individual entropies. Abandoning this property yields the so-called NE entropies (Havrda & Charv´t, a 1967; Tsallis, 1988), which have raised great interest among physicists in modeling certain phenomena (e.g., long-range interactions and multifractals) and as generalizations of Boltzmann-Gibbs statistical mechanics (Abe, 2006). NE entropies have also been recently used in signal/image processing (Li et al., 2006) and other areas (Gell-Mann & Tsallis, 2004). The main contributions of this paper are: · Based on the new concept of q -convexity and a related q -Jensen inequality, we introduce the Jensen-Tsal lis q -difference, a NE generalization of the Jensen-Shannon (JS) divergence. · We propose a broad family of positive definite (pd) kernels, which are interpretable as NE mutual information (MI) kernels. This family ranges from the Boolean to the linear kernels, and also includes the JS kernel (Hein & Bousquet, 2005). · We extend results of Hein and Bousquet (2005) by proving positive definiteness of kernels based on the unbalanced JS divergence. As a side note, we

1. Intro duction
There has been recent interest in kernels on probability distributions, to tackle several classification problems (Moreno et al., 2003; Jebara et al., 2004; Hein & Bousquet, 2005; Lafferty & Lebanon, 2005; Cuturi et al., 2005). By mapping data points to fitted distributions in a parametric family where a kernel is defined, a kernel is automatically induced on the original input space. In text categorization, this appears as an alternative to the Euclidean geometry inherent to the usual bag-of-words vector representations. In fact,
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Nonextensive Entropic Kernels

show that the parametrix approximation of the multinomial diffusion kernel introduced by Lafferty and Lebanon (2005) is not pd in general. Our main purpose is to present new theoretical insights about kernels on measures by unifying some wellknown instances into a common parametrized family. This family allows reinterpreting these kernels in light of NE information theory, a connection that to our knowledge had not been presented before. The fact that some members of this family are novel pd kernels leads us to include a set of text categorization experiments that illustrates their effectiveness. The paper is organized as follows. Sec. 2 reviews NE entropies, while Jensen differences and divergences are discussed in Sec. 3. In Sec. 4, the concepts of q differences and q -convexity are introduced and used to define the Jensen-Tsallis q -difference. Sec. 5 presents the new family of entropic kernels. Sec. 6 reports experiments on text categorization and Sec. 7 presents concluding remarks and future research directions. Although, for simplicity, we focus on discrete distributions on finite sets, most results are valid in arbitrary measured spaces, as shown by Martins et al. (2008).

where Eq (f ) n 1 P (xi )q f (xi ) is the unnormalized i= q -expectation, and lnq (y ) (y 1-q - 1)/(1 - q ) is the socalled q -logarithm. It is noteworthy that when q  1, we get Eq  E, lnq  ln, and Sq  H ; i.e., the family of Tsallis entropies also includes Shannon's entropy. For the Tsallis family, when X and Y are independent, extensivity no longer holds; instead, we have Sq (X, Y ) = Sq (X ) + Sq (Y ) - (q - 1)Sq (X )Sq (Y ), (3) where the parameter q  0 is called entropic index. While statistical physics has been the main application of Tsallis entropies, some attempts have been made to produce NE generalizations of classic information theory results (Furuichi, 2006). As for the Shannon entropy, the Tsallis joint and conditional entropies are defined as Sq (X, Y ) -Eq [lnq PX Y ] and Sq (X |Y ) -Eq [lnq PX |Y ], respectively, and follow a chain rule Sq (X, Y ) = Sq (X ) + Sq (Y |X ). Similarly, Furuichi (2006) defines the Tsallis MI as Iq (X ; Y ) Sq (X ) - Sq (X |Y ) = Iq (Y ; X ), (4)

generalizing (for q > 1) Shannon's MI. This NE version of the MI underlies one of the central contributions of this paper: the Jensen-Tsallis q -difference (Sec. 4). For reasons that will become clear in Sec. 5, it is convenient to extend the domain of Tsallis entropies to un{µ  Rn | i µi  normalized measures, i.e., in Rn + 0}, but not necessarily in the probability simplex n Pn-1 {p  Rn | i=1 pi = 1, i pi  0}. The Tsallis entropy of a measure µ in Rn is1 + Sq (µ) - in
=1

2. Nonextensive Information Theory
Let X denote a random variable (rv) taking values in a finite set X = {x1 , . . . , xn } according to a probability distribution PX . An entropy function is said to be extensive if it is additive over independent variables. For example, the Shannon entropy (Cover & Thomas, 1991), H (X ) -E[ln PX ], is extensive: if X and Y are independent, then H (X, Y ) = H (X ) + H (Y ). Another example is the family of R´nyi entropies (R´nyi, 1961), e e parameterized by q  0, Rq (X ) 1 ln 1-q in
=1

µq i

lnq µi =

in
=1

q (µi ),

(5)

PX (xi )q ,

(1)

which includes Shannon's entropy as a special case when q  1. In classic information theory, extensivity is considered desirable, and is enforced axiomatically (Khinchin, 1957), to express the idea borrowed from thermodynamics that "independent systems add their entropies." In contrast, the Tsal lis entropies abandon the extensivity requirement (Tsallis, 1988). These NE entropies, denoted Sq (X ), are defined as follows: , 1 in 1 q - PX (xi ) Sq (X ) -Eq (lnq PX ) = q-1 =1 (2)

where q : R+  R is given by - y ln y , if q = 1, q (y ) = -y q lnq y = (y - y q )/(q - 1), if q = 1. (6) This extension does not add expressive power, since function (5) is completely determined by its values on Pn-1 , as shown by the following proposition (the proof is straightforward). Prop osition 1 The fol lowing denormalization formula holds for any c  0 and µ  Rn : + Sq (cµ) = cq Sq (µ) + q (c) µ 1 , where µ
1

(7)

n 1 µi is the 1-norm of µ. i=

1 In the following, we represent normalized and unnormalized measures as vectors in Rn , and we use those as arguments of entropy functions, e.g., we write H ( ) to denote H (X ) where X  P (X ), with i = P (xi ).


Nonextensive Entropic Kernels

This fact will be used in a constructive way in Sec. 5 to devise a family of pd NE entropic kernels.

Jensen-R´nyi (JR) Divergence Let  = Rq , e which is concave for q  [0, 1); then, (8) becomes
 JRq (p1 , . . . , pm ) = Rq (E[p]) - E[Rq (p)].

(10)

3. Jensen Differences and Divergences
Jensen's inequality is at the heart of many important results in information theory. Let the rv Z take values on a finite set Z . Jensen's inequality states that if f is a convex function defined on the convex hull of Z , then f (E[Z ])  E[f (Z )]. The nonnegative quantity E[f (Z )] - f (E[Z ]) is known as Jensen difference and has been studied by Burbea and Rao (1982) when -f is some form of generalized entropy. Here, we are interested in the case where Z  {µ1 , . . . , µm } is a random measure, where each µj  Rn , with probabil+ ities  = (1 , . . . , m )  Pm-1 . The Jensen difference induced by a (concave) generalized entropy  is   jm jm  j µj  - j (µj ) J (µ1 , . . . , µm ) 
=1 =1

 We call JRq the JR divergence. When m = 2 and   = (1/2, 1/2), we write JRq (p) = JRq (p1 , p2 ), where - p Rq (p1 ) + Rq (p2 ) 1 + p2 . JRq (p1 , p2 ) = Rq 2 2

The JR divergence has been used in signal processing applicationJ (Karakos et al., 2007). We show in Sect. s 5.3 that Rq is also an Hilbertian metric. Jensen-Tsallis (JT) Divergence Divergences of the form (8), with  = Sq , are known as JT divergences (Burbea & Rao, 1982) and were recently used in image processing (Hamza, 2006). Unlike the JS divergence, the JT divergence lacks a MI interpretation; in Sec. 4, we introduce an alternative to the JT divergence, which is interpretable as a NE MI in the sense of Furuichi (2006).

=  (E[Z ]) - E[(Z )],

(8)

Below, we show examples of Jensen differences that have been applied in machine learning. In Sec. 4, we provide a NE generalization of the Jensen difference. Jensen-Shannon (JS) Divergence Consider a classification problem with m classes, Y  Y = {1, . . . , m}, with a priori probabilities  = (1 , . . . , m )  Pm-1 . Let pj = (pj 1 , . . . , pj n )  Pn for j = 1, . . . , m, where pj i P (X = xi |Y = j ), be the corresponding class-conditional distributions. Letting  in (8) be H , the Shannon entropy, the result ing Jensen difference JH (p1 , . . . , pm ) is known as the JS divergence of p1 , . . . , pm , with weights 1 , . . . , m (Burbea & Rao, 1982; Lin, 1991). In this instance of the Jensen difference,
 JH (p1 , . . . , pm ) = I (X ; Y ),

4. Jensen q -Differences
We now introduce Jensen q -differences, a generalization of Jensen differences. As described shortly, a special case of the Jensen q -difference is the Jensen-Tsal lis q -difference, which is an NE generalization of the JS divergence, and provides the building block for the NE entropic kernels to be introduced in Sec. 5. We begin by introducing the concept of "q -convexity", which satisfies a Jensen-type inequality. Definition 1 Let q  R and X a convex set. A function f : X  R is q -convex if, for any x, y  X and   [0, 1], f (x + (1 - )y )  q f (x) + (1 - )q f (y ). f is q -concave if -f is q -convex. Naturally, 1-convexity is the usual convexity. The next proposition states the q -Jensen inequality and is easily proved by induction, like the standard Jensen inequality (Cover & Thomas, 1991). It also states that the property of q -convexity gets stronger as q increases. Prop osition 2 If f : X  R is q -convex and f  0, then, for any n  N, x1 , . . . , xn  X and   Pn-1 : n n i i q f i xi i f (xi ). (12)
=1 =1

(11)

(9)

where I (X ; Y ) = H (X ) - H (X |Y ) is the MI between X and Y (Banerjee et al., 2005). For m = 2 and  = ( 1 , 1 ), we denote the ensuing 22 JH2
(1,1) 2

(p1 , p2 ) as JS (p1 , p2 ):

JS (p1 , p2 ) = H ((p1 + p2 )/2) - (H (p1 ) + H (p2 ))/2.  It can be shown that that JS satisfies the triangle inequality and is a Hilbertian metric 2 (Endres & Schindelin, 2003; Topsøe, 2000), which has motivated its use in kernel-based machine learning.
2 A metric d : X × X  R is Hilbertian if there is some Hilbert space H and an isometry f : X  H such that d2 (x, y ) = f (x)-f (y ), f (x)-f (y ) H holds for any x, y  X (Hein & Bousquet, 2005).

Moreover, if q  r  0, we have: f is q -convex f is r-concave  f is r-convex  f is q -concave. (13) (14)


Nonextensive Entropic Kernels

Based on the q -Jensen inequality, we can now consider Jensen q -differences of the form Eq [f (Z )] - f (E[Z ]), which are nonnegative if f is q -convex. As in Sec. 3, we focus on the scenario where Z is a random measure and -f =  is an entropy function, yielding m -m t t q  Tq, (µ1 , . . . , µm )  t µt t (µt )
=1 =1

 generally true that Tq (p, . . . , p) = 0 or even that  )  Tq (p, . . . , p, p  Tq (p, . . . , p, p). For example,

argminp1 Pn-1 Tq (p1 , p2 )

(18)

=  (E[Z ]) - Eq [(Z )].

(15)

The Jensen q -difference is a deformation of the Jensen 1-difference (8), in which the second expectation is replaced by a q -expectation. We are now ready to introduce the class of Jensen-Tsallis q -differences. Jensen-Tsallis q -Differences Consider again the classification problem used in the description of the JS divergence, but replacing the Jensen difference with the Jensen q -difference and the Shannon entropy with the Tsallis q -entropy, i.e., letting  = Sq in (15). We   obtain (writing Tq,Sq simply as Tq ):
 Tq (p1 , . . . , pm ) = Sq (X ) - Sq (X |Y ) = Iq (X ; Y ), (16)

can be different from p2 , unless q = 1. In general, the minimizer is closer to either the uniform distribution (if q  [0, 1)) or a degenerate distribution3 (for q  (1, 2]). For these reasons, the term "divergence" is misleading and we use the term "difference." Other properties of JT q -differences (convexity, lower/upper bounds) are studied by Martins et al. (2008).

5. Nonextensive Entropic Kernels
Using the denormalization formula (7), we now introduce kernels based on the JS divergence and the JT q difference, which allow weighting their arguments. In this section, m = 2 (kernels involve pairs of measures). 5.1. Background on Kernels We begin with some basic results on kernels (Sch¨lkopf o & Smola, 2002). Below, X denotes a nonempty set; R+ denote the nonnegative reals, and R++ R+ \ {0}. Definition 2 Let  : X × X  R be a symmetric function, i.e., (y , x) = (x, y ), for al l x, y  X .  is cal led a pd kernel if and only if in jn
=1 =1

where Sq (X |Y ) is the Tsallis conditional q -entropy, and Iq (X ; Y ) is the Tsallis MI (cf. (4)). Note that (16) is an NE analogue of (9), i.e. the Jensen-Tsallis q -differences are NE mutual informations.
 We call Tq (p1 , . . . , pm ) the Jensen-Tsallis q -difference of p1 , . . . , pm with weights 1 , . . . , m .

ci cj (xi , xj )  0,

(19)

, When m = 2 and  = (1/2, 1/2), define Tq Tq - p Sq (p1 ) + Sq (p2 ) 1 + p2 . (17) Tq (p1 , p2 ) = Sq 2 2q Three special cases are obtained for q  {0, 1, 2}: S0 (p) = -1 + p 0 ; S1 (p) = H (p); wS2 (p) = 1 - p, p ; T0 (p1 , p2 ) = 1 - p1 p2 0 T1 (p1 , p2 ) = JS (p1 , p2 ) 11 T2 (p1 , p2 ) = - p1 , p2 22

(1/2,1/2)

for any integer n, xi , . . . , xn  X and ci , . . . , cn  R. A symmetric function  : X × X  R is cal led a negative definite (nd) kernel if and only if in jn
=1 =1

ci cj  (xi , xj )  0,

(20)

for any integer n, xi , . . . , xn  X andi ci , . . . , cn  R, satisfying the additional constraint ci = 0. In this case, - is cal led conditionally pd; obviously, positive definiteness implies conditional positive definiteness. Both the sets of pd and nd kernels are closed under pointwise sums/integrations, the former being also closed under pointwise products; moreover, both sets are closed under pointwise convergence. While pd kernels correspond to inner products via embedding in a Hilbert space, nd kernels that vanish on the diagonal and are positive anywhere else, correspond to squared Hilbertian distances. These facts, and the following ones, are shown by Berg et al. (1984).
1 1 Notice that T2 (p1 , p2 ) = 2 - 2 p1 , p2 ; in this case, (18) becomes a linear program, and the solution is p = ej , 1 where j = argmaxi p2i . 3

here x 0 is the number of nonzeros in x, denotes i the Hadamard-Schur (elementwise) product, and ·, · s the inner product. The JT q -difference is an NE generalization of the JS divergence, and some of the latter's properties are lost in general. Since Tsallis entropies are 1-concave, Prop. 2 guarantees q -concaveness only for q  1. Therefore, nonnegativity is only guaranteed for JT q differences when q  1; for this reason some authors only consider this range of values (Furuichi, 2006). Moreover, unless q = 1 (the JS divergence), it is not


Nonextensive Entropic Kernels

Prop osition 3 Let  : X × X  R be a symmetric function, and x0  X . Let  : X × X  R be (x, y ) =  (x, x0 ) +  (y , x0 ) -  (x, y ) -  (x0 , x0 ). Then,  is pd if and only if  is nd. Prop osition 4 The function  : X × X  R is a nd kernel if and only if exp(-t ) is pd for al l t > 0. Prop osition 5 The function  : X × X  R+ is a nd kernel if and only if (t +  )-1 is pd for al l t > 0. Prop osition 6 If  is nd and  (x, x)  0, for al l x  X , then so are   , for   [0, 1], and ln(1 +  ). Prop osition 7 If f : X  R satisfies f  0, then, for   [1, 2], the function -(f (x) + f (y )) is a nd kernel. 5.2. Jensen-Shannon and Tsallis Kernels The basic result that allows deriving pd kernels based on the JS divergence and, more generally, on the JT q -difference, is the fact that the denormalized Tsallis q -entropies are nd functions4 on Rn , for q  [0, 2]. + Of course, this includes the denormalized Shannon entropy as a particular case, corresponding to q = 1. Partial proofs are given by Berg et al. (1984), Topsøe (2000), and Cuturi et al. (2005); we present here a complete proof. Prop osition 8 For q  [0, 2], the denormalized Tsallis q -entropy Sq is an nd function on Rn . + Proof: Since nd kernels are closed under pointwise summation, it suffices to prove that q (see (6)) is nd on R+ . For q = 1, q (y ) = (q - 1)-1 (y - y q ). If q  [0, 1), q equals - + q times a positive constant, where  is the identity ((y ) = y ) on R+ . Since the set of nd functions is closed under sums, we only need to show that both - and q are nd, which is easily seen from the definition; besides, since  is nd and nonnegative, Prop. 6 implies that q is also nd. For q  (1, 2], q equals  - q times a positive constant. It remains to show that -q is nd for q  (1, 2]; since k (x, y ) = -(x + y )q is nd (Prop. 7), so is q . For q = 1, since the set of nd functions is closed under limits, 1 (x) = H (x) = -x ln x = lim -xq lnq x = lim q (x),
q 1 q 1

Prop osition 9 The function q : R++  R, defined as q (y ) = y -q is pd, for q  [0, 1]. We now present the main contribution of this section, the family of weighted JT kernels, generalizing the JS divergence kernels in two ways: (i) they apply to unnormalized measures (equivalently, they allow weighting the arguments differently); (ii) they extend the MI nature of the JS divergence kernel to the NE case. Definition 3 (weighted Jensen-Tsallis kernels) The kernel kq : (Rn )2  R is defined as + kq (µ , µ ) 1 2 = kq (1 p1 , 2 p2 ) (  Sq ( ) - Tq (p1 , p2 ) 1 + 2 )q , where p1 = µ1 /1 and p2 = µ2 /2 are the normalized counterparts of µ1 and µ2 , with corresponding weights 1 , 2  R+ , and  = (1 /(1 + 2 ), 2 /(1 + 2 )). The kernel kq : (Rn + )2  R is defined as + kq (µ1 , µ2 ) = kq (1 p1 , 2 p2 )
 Sq ( ) - Tq (p1 , p2 ).

Recalling (16), notice Sq (Y ) - Iq (X ; Y ) = Sq (Y |X ) can be interpreted as the Tsal lis posterior conditional entropy. Hence, kq can be seen (in Bayesian classification terms) as a NE expected measure of uncertainty in correctly identifying the class given the prior  = (1 , 2 ) and a random sample from the mixture distribution 1 p1 + 2 p2 . The more similar the two distributions are, the greater this uncertainty. Prop osition 10 The kernel kq is pd, for q  [0, 2]. The kernel kq is pd, for q  [0, 1]. Proof: With µ1 = 1 p1 and µ2 = 2 p2 and using the denormalization formula (7), we obtain kq (µ1 , µ2 ) = -Sq (µ1 + µ2 ) + Sq (µ1 ) + Sq (µ2 ). Now invoke Prop. 3 with  = Sq (which is nd by Prop. 8), x = µ1 , y = µ2 , and x0 = 0 (the null measure). Observe now that kq (µ1 , µ2 ) = kq (µ1 , µ2 )(1 + 2 )-q . Since the product of two pd kernels is a pd kernel and (Prop. 9) (1 + 2 )-q is a pd kernel, for q  [0, 1], kq is pd. As we can see, the weighted JT kernels have two inherent properties: they are parameterized by the entropic index q and they allow their arguments to be unbalanced, i.e., to have different weights i . We now mention some instances of kernels where each of these degrees of freedom is suppressed. Weighted JS Kernel Setting q = 1, we obtain an extensive subfamily that contains unbalanced versions of the JS kernel (Hein & Bousquet, 2005). Namely, we

it follows that 1 is nd. The following proposition, proved by Berg et al. (1984), will also be used below.
A function f : X  R is called pd (resp. nd) if k : X × X  R, defined as k(x, y ) = f (x + y ), is a pd (resp. nd) kernel (Berg et al., 1984).
4


Nonextensive Entropic Kernels

get the pd kernels: k1 (µ , µ ) = [H ( ) - J  (p1 , p2 )] (1 + 2 ), 1 2 k1 (µ1 , µ2 ) = H ( ) - J  (p1 , p2 ). (21) Exp onentiated Weighted JS Kernel Using Prop. 4, we have that exponentiated weighted JS kernel kEWJS : Rn  R, + kEWJS (µ1 , µ2 ) exp[t k1 (µ1 , µ2 )] = exp(t H ( )) exp [-tJ  (p1 , p2 )](22)

histograms); instead of the usual normalization (Hein & Bousquet, 2005), these empirical measures may be left unnormalized, allowing ob jects of different sizes to have different weights. Another possibility is the explicit inclusion of weights (i ): given an input set of normalized measures, each can be multiplied by an arbitrary (positive) weight before computing the kernel. 5.3. Other Kernels Based on Jensen Differences Other pd kernels may be devised inspired by JensenR´nyi and Jensen-Tsallis divergences (Section 3). For e example, it is a direct consp quenc,e of Prop. 6 that, for e q  [0, 1], (p1 , p2 )  Rq 1 +p2 and therefore J Rq , 2 are nd kernels on (Pn-1 )2 . We can then make use of Prop. 4 to derive pd kernels via exponentiation; for example, the exponentiated Jensen-R´nyi kernel (pd e for q  [0, 1] and t  0): kE J R (p1 , p2 ) = exp(-t JRq (p1 , p2 )) t ip q - 1-q
1i +p2i

is also pd for any t > 0. This generalizes the exponentiated JS kernel kEJS (p1 , p2 ) exp [-t JS (p1 , p2 )] (Cuturi et al., 2005). We now keep q  [0, 2] but consider the weighted JT kernel family restricted to normalized measures, kq |(Pn-1 )2 . This corresponds to setting uniform weights (1 = 2 = 1/2); note that in this case kq and kq collapse into the same kernel, kq (p1 , p2 ) = kq (p1 , p2 ) = lnq (2) - Tq (p1 , p2 ). (23) Prop. 10 tells us that these kernels are pd for q  [0, 2]. Remarkably, we recover three well-known particular cases for q  {0, 1, 2}. Jensen-Shannon kernel (JSK) For q = 1, we obtain the JS kernel, kJS : (Pn-1 )2  R, kJS (p1 , p2 ) = ln(2) - JS (p1 , p2 ), (24)

2

i

pq i 1

i

pq i 2

. (27)

However, these kernels are no longer interpretable as MIs, and arbitrary weights are not allowed. Martins et al. (2008) also show that a related family of pd kernels for probability measures introduced by Hein and Bousquet (2005) can be written as differences between JT-type divergences. 5.4. The Heat Kernel Approximation The diffusion kernel for statistical manifolds, recently proposed by Lafferty and Lebanon (2005), is grounded in information geometry. It models the diffusion of "information" over the manifold through the heat equation. Since in the case of the multinomial manifold the diffusion kernel has no closed form, the authors adopt the so-called "first-order parametrix expansion," which resembles the Gaussian kernel replacing the Euclidean distance by the geodesic distance induced by the Fisher information metric. The resulting heat kernel approximation is k heat (p1 , p2 ) = (4  )
-n 2

introduced and shown pd by Hein and Bousquet (2005). Bo olean kernel For q = 0, we obtain the kernel k0 = kBool : (Pn-1 )2  R, kBool (p1 , p2 ) = p1 p2 0. (25)

Linear kernel For q = 2, we obtain the kernel k2 = klin : (Pn-1 )2  R, klin (p1 , p2 ) = 1 p1 , p2 . 2 (26)

- exp

Summarizing, Boolean, JS, and linear kernels, are members of the much wider family of Tsallis kernels, continuously parameterized by q  [0, 2]. Furthermore, Tsallis kernels are a particular subfamily of the even wider set of weighted Tsallis kernels. A key feature of our generalization is that the kernels are defined on unnormalized measures. This is relevant for empirical measures (e.g., term counts, image

12 d (p1 , p2 ) 4t g

,

i  (28) . where  > 0 and dg (p1 , p2 ) = 2 arccos p1 i p 2 i Whether k heat is pd has been an open problem (Hein et al., 2004; Zhang et al., 2005). Prop osition 11 Let n  2. For sufficiently large  , the kernel kheat is not pd.


Nonextensive Entropic Kernels

Proof: From Prop. 4, kheat is pd, for all  > 0, if and only if d2 is nd. We provide a counterexample, g using the following four points in P2 : p1 = (1, 0, 0), p2 = (0, 1, 0), p3 = (0, 0, 1) and p4 = (1/2, 1/2, 0). The squared distance matrix [Dij ] = [d2 (pi , pj )] is g   0441 2  4 0 4 1  . (29) · D= 4 4 4 0 4 1140 Taking c = (-4, -4, 1, 7) we have cT Dc = 2 2 > 0, showing that D is not nd. Although p1 , p2 , p3 , p4 lie on the boundary of P2 , continuity of d2 implies that it g is not nd. The case n > 2 follows easily, by appending zeros to the four vectors above.

ter and the SVM C parameter were tuned with crossvalidation over the training set. The SVM-Light package (http://svmlight.joachims.org/) was used to solve the SVM quadratic optimization problem. Figs. 1­2 summarize the results. We report the performance of the Tsallis kernels as a function of the entropic index. For comparison, we also plot the performance of an instance of a Tsallis kernel with q tuned through cross-validation. For the first task, this kernel and the two baselines exhibit similar performance for both the tf and the tf-idf representations; differences are not statiscally significant. In the second task, the Tsallis kernel outperformed the 2-normalized linear kernel for both representations, and the heat kernel for tf-idf ; the differences are statistically significant (using the unpaired t test at the 0.05 level). Regarding the influence of the entropic index, we observe that in both tasks, the optimum value of q is usually higher for tf-idf than for tf. The results on these two problems are representative of the typical relative performance of the kernels considered: in almost all tested cases, both the heat kernel and the Tsallis kernels (for a suitable value of q ) outperform the 2-normalized linear kernel; the Tsallis kernels are competitive with the heat kernel.
tf Average error rate (%) Average error rate (%)
2.5
WJTK-q LinL2 Heat (CV) WJTK (qCV)

6. Exp eriments
We illustrate the performance of the proposed NE kernels, in comparison with common kernels, for SVM text classification. We performed experiments in two standard datasets: Reuters-21578 and WebKB.5 Since our ob jective was to evaluate the kernels, we considered a simple binary classification task that tries to discriminate among the two largest categories of each dataset; this led us to the earn-vs-acq classification task for the first dataset, and stud-vs-fac (student vs. faculty webpages) in the second dataset. After the usual preprocessing steps of stemming and stop-word removal, we mapped text documents into probability distributions over words using the bagof-words model and maximum likelihood estimation (which corresponds to normalizing term frequency using the 1-norm), which we denote by tf. We also used the tf-idf measure, which penalizes terms that occur in many documents. To weight the documents for the Tsallis kernels, we tried four strategies: uniform weighting, word counts, square root of the word counts, and one plus the logarithm of the word counts; however, for both tasks, uniform weighting revealed the best strategy, which may be due to the fact that documents in both collections are usually short and do not differ much in size. As baselines, we used the linear kernel with 2 normalization, commonly used for this task, and the heat kernel approximation (28) (Lafferty & Lebanon, 2005), which is known to outperform the former, albeit not being guaranteed to be pd for an arbitrary choice of  (see 28), as shown above. This parame5 Available at http://www.daviddlewis.com/ resources/testcollections and http://www.cs. cmu.edu/afs/cs.cmu.edu/project/theo- 20/www/data, respectively.

tf-idf
2.5
WJTK-q LinL2 Heat (CV) WJTK (qCV)

2

2

1.5

1.5

1

1 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2

0 0.25 0.5 0.75 1 1.25 1.5 1.75 2

Entropic index q

Entropic index q

Figure 1. Results for earn-vs-acq using tf and tf-idf representations. The error bars represent ±1 standard deviation on 30 runs. Training (resp. testing) with 200 (resp. 250) samples per class.

7. Conclusion
We have introduced a new family of positive definite kernels between measures, which contains some well-known kernels as particular cases. These kernels are defined on unnormalized measures, which makes them suitable for use on empirical measures (e.g., word counts or pixel intensity histograms), allowing ob jects of different sizes to be weighted differently. The family is parameterized by the entropic index, a key concept in Tsallis statistics, and includes as extreme cases the Boolean and the linear kernels. The new kernels, and


Nonextensive Entropic Kernels
tf
11

tf-idf Average error rate (%)
10 9 8 7 6 5
WJTK-q LinL2 Heat (CV) WJTK (qCV)

Average error rate (%)

10 9 8 7 6

WJTK-q LinL2 Heat (CV) WJTK (qCV)

0 0.25 0.5 0.75 1 1.25 1.5 1.75 2

0 0.25 0.5 0.75 1 1.25 1.5 1.75 2

Entropic index q

Entropic index q

Figure 2. Results for stud-vs-fac.

the proofs of positive definiteness, are supported by the other contributions of this paper: the new concept of q -convexity, the underlying Jensen q -inequality, and the concept of Jensen-Tsal lis q -difference, a nonextensive generalization of the Jensen-Shannon divergence. Experimentally, kernels in this family outperformed the linear kernel in the task of text classification and achieved similar results to the first-order approximation of the multinomial diffusion kernel. They have the advantage, however, of being pd, which fails to happen with the latter kernel, as also shown in this paper. Future research will concern applying Jensen-Tsallis q differences to other learning problems, like clustering, possibly exploiting the fact that they accept more than two arguments.

Acknowledgments
The authors thank the reviewers for helpful comments and Guy Lebanon for fruitful discussions on the heat kernel. This work was partially supported by Funda¸~o para a Ci^ncia e Tecnologia (FCT), Portuca e gal, grant PTDC/EEA-TEL/72572/2006. A.M. was supported by a grant from FCT through the CMUPortugal Program and the Information and Communications Technologies Institute (ICTI) at CMU. N.S. was supported by NSF IIS-0713265 and DARPA HR00110110013. E.X. was supported by NSF DBI0546594, DBI-0640543, and IIS-0713379.

References
Abe, S. (2006). Foundations of nonextensive statistical mechanics. Chaos, Nonlinearity, Complexity. Springer. Banerjee, A., Merugu., S., Dhillon, I. S. & Ghosh, J. (2005). Clustering with Bregman divergences. JMLR, 6, 1705­1749. Berg, C., Christensen, J., & Ressel, P. (1984). Harmonic analysis on semigroups. Springer. Burbea, J., & Rao, C. (1982). On the convexity of some divergence measures based on entropy functions. IEEE Trans. Inf. Theory, 28, 489­495.

Cover, T., & Thomas, J. (1991). Elements of information theory. Wiley. Cuturi, M., Fukumizu, K., & Vert, J.-P. (2005). Semigroup kernels on measures. JMLR, 6, 1169­1198. Endres, D., & Schindelin, J. (2003). A new metric for probability distributions. IEEE Trans. Inf. Theory, 49, 1858­1860. Fuglede, B. (2005). Spirals in Hilbert space. Expositiones Mathematicae, 25, 23­46. Furuichi, S. (2006). Information theoretical properties of Tsallis entropies. J. Math. Physics, 47, no. 2. Gell-Mann, M., & Tsallis, C. (2004). Nonextensive entropy: interdisciplinary applications. Oxford University Press. Hamza, A. (2006). A nonextensive information-theoretic measure for image edge detection. Journal of Electronic Imaging, 15-1, 13011.1­13011.8. Havrda, M., & Charv´t, F. (1967). a Quantification method of classification processes: concept of structural -entropy. Kybernetika, 3, 30­35. Hein, M., & Bousquet, O. (2005). Hilbertian metrics and positive definite kernels on probability measures. Proc. 10th AISTATS. Hein, M., Lal, T., & Bousquet, O. (2004). Hilbertian metrics on probability measures and their application in SVMs. DAGM-Symposium, 270-277. Jebara, T., Kondor, R., & Howard, A. (2004). Probability product kernels. JMLR, 5, 819­844. Joachims, T. (1997). Text categorization with support vector machines: Learning with many relevant features (T.R.). Universit¨t Dortmund. a Karakos, D., Khudanpur, S., Eisner, J., & Priebe, C. (2007). Iterative denoising using Jensen-R´nyi divere gences with an application to unsupervised document categorization. Proc. IEEE ICASSP. Khinchin, A. (1957). Mathematical foundations of information theory. Dover. Lafferty, J., & Lebanon, G. (2005). Diffusion kernels on statistical manifolds. JMLR, 6, 129­163. Li, Y., Fan, X., & Li, G. (2006). Image segmentation based on Tsallis-entropy and R´nyi-entropy and their e comparison. IEEE Intern. Conf. on Industrial Informatics (pp. 943­948). Lin, J. (1991). Divergence measures based on Shannon entropy. IEEE Trans. Inf. Theory, 37, 145­151. Martins, A.F.T., Figueiredo, M.A.T., Aguiar, P.M.Q., Smith, N.A., Xing, E.P. (2008). Nonextensive entropic kernels. T.R. CMU-ML-08-106. Moreno, P., Ho, P., & Vasconcelos, N. (2003). A KullbackLeibler divergence based kernel for SVM classification in multimedia applications. NIPS. R´nyi, A. (1961). On measures of entropy and information. e Proc. 4th Berkeley Symp. Math. Statist. and Prob. (pp. 547­561). Univ. Calif. Press. Sch¨lkopf, B., & Smola, A. (2002). Learning with kernels. o MIT Press. Topsøe, F. (2000). Some inequalities for information divergence and related measures of discrimination. IEEE Trans. Inf. Theory, 46, 1602­1609. Tsallis, C. (1988). Possible generalization of BoltzmannGibbs statistics. J. Stats. Physics, 52, 479­487. Zhang, D., Chen, X., & Lee, W. (2005). Text classification with kernels on the multinomial manifold. Proc. ACM SIGIR.