Learning with Transformation Invariant Kernels

Christian Walder Max Planck Institute for Biological Cybernetics ╗ 72076 Tubingen, Germany christian.walder@tuebingen.mpg.de

Olivier Chapelle Yahoo! Research Santa Clara, CA chap@yahoo-inc.com

Abstract
This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale).

1 Introduction
Recent years have seen widespread application of reproducing kernel Hilbert space (r.k.h.s.) based ╗ methods to machine learning problems (Scholkopf & Smola, 2002). As a result, kernel methods have been analysed to considerable depth. In spite of this, the aspects which we presently investigate seem to have received insufficient attention, at least within the machine learning community. The first is transformation invariance of the kernel, a topic touched on in (Fleuret & Sahbi, 2003). Note we do not mean by this the local invariance (or insensitivity) of an algorithm to application specific transformations which should not affect the class label, such as one pixel image translations ╗ (see e.g. (Chapelle & Scholkopf, 2001)). Rather we are referring to global invariance to transformations, in the way that radial kernels (i.e. those of the form k (x, y ) = ( x - y )) are invariant to translations. In Sections 2 and 3 we introduce the more general concept of transformation scaledness, focusing on translation, dilation and orthonormal transformations. An interesting result is that there exist no non-trivial p.d. kernel functions which are radial and dilation scaled. There do exist non-trivial c.p.d. kernels with the stated invariances however. Motivated by this, we analyse the c.p.d. case in Section 4, giving novel elementary derivations of some key results, most notably a c.p.d. representer theorem. We then give in Section 6.1 an algorithm for applying the s.v.m. with arbitrary c.p.d. kernel functions. It turns out that this is rather useful in practice, for the following reason. Due to its invariances, the c.p.d. thin-plate kernel which we discuss in Section 5, is not only richly non-linear, but enjoys a duality between the length-scale parameter and the regularisation parameter of Tikhonov regularised solutions such as the s.v.m. In Section 7 we compare the resulting classifier (which has only a regularisation parameter), to that of the s.v.m. with Gaussian kernel (which has an additional length scale parameter). The results show that the two algorithms perform roughly as well as one another on a wide range of standard machine learning problems, notwithstanding the new method's advantage in having only one free parameter. In Section 8 we make some concluding remarks. 1


2 Transformation Scaled Spaces and Tikhonov Regularisation
Definition 2.1. Let T be a bijection on X and F a Hilbert space of functions on some non-empty set X such that f  f  T is a bijection on F . F is T -scaled if f, g F = gT (F ) f  T , g  T F
+

(1)

for all f  F , where gT (F )  R is the norm scaling function associated with the operation of T on F . If gT (F ) = 1 we say that F is T -invariant. The following clarifies the behaviour of Tikhonov regularised solutions in such spaces. Lemma 2.2. For any  : F - R and T such that f  f  T is a bijection of F , if the left hand -  side is unique then a  arg min (f ) = rg min (fT  T ) T
f F fT F  Proof. Let f  = arg minf F (f ) and fT = arg minfT F (fT  T ). By definition we have  that g  F , (fT  T )  (g  T ). But since f  f  T is a bijection on F , we also have   g  F , (fT  T )  (g ). Hence, given the uniqueness, this implies f  = fT  T .

The following Corollary follows immediately from Lemma 2.2 and Definition 2.1. Corollary 2.3. Let Li be any loss function. If F is T -scaled and the left hand side is unique then f =a f  i i arg min 2+ Li (f (xi )) rg min 2 /gT (F ) + Li (f (T xi )) T. F F
f F f F

Corollary 2.3 includes various learning algorithms for various choices of Li -- for example the s.v.m. with linear hinge loss for Li (t) = max (0, 1 - yi t), and kernel ridge regression for Li (t) = 2 (yi - t) . Let us now introduce the specific transformations we will be considering. Definition 2.4. Let Ws , Ta and OA be the dilation, translation and orthonormal transformations Rd  Rd defined for s  R \ {0}, a  Rd and orthonormal A : Rd  Rd by Ws x = sx, Ta x = x + a and OA x = Ax respectively. Hence, for an r.k.h.s. which is Ws -scaled for arbitrary s = 0, training an s.v.m. and dilating the resultant decision function by some amount is equivalent training the s.v.m. on similarly dilated input patterns but with a regularisation parameter adjusted according to Corollary 2.3. While (Fleuret & Sahbi, 2003) demonstrated this phenomenon for the s.v.m. with a particular kernel, as we have just seen it is easy to demonstrate for the more general Tikhonov regularisation setting with any function norm satisfying our definition of transformation scaledness.

3 Transformation Scaled Reproducing Kernel Hilbert Spaces
We now derive the necessary and sufficient conditions for a reproducing kernel (r.k.) to correspond to an r.k.h.s. which is T -scaled. The relationship between T -scaled r.k.h.s.'s and their r.k.'s is easy to derive given the uniqueness of the r.k. (Wendland, 2004). It is given by the following novel Lemma 3.1 (Transformation scaled r.k.h.s.). The r.k.h.s. H with r.k. k : X в X  R, i.e. with k satisfying k(Ї, x), f (Ї) H = f (x), (2) is T -scaled iff k (x, y ) = gT (H) k (T x, T y ). (3) Which we prove in the accompanying technical report (Walder & Chapelle, 2007) . It is now easy to see that, for example, the homogeneous polynomial kernel k (x, y ) = x, y p corresponds to a Ws -scaled r.k.h.s. H with gWs (H) = x, y p / sx, sy p = s-2p . Hence when the homogeneous polynomial kernel is used with the hard-margin s.v.m. algorithm, the result is invariant to multiplicative scaling of the training and test data. If the soft-margin s.v.m. is used however, then the invariance 2


holds only under appropriate scaling (as per Corollary 2.3) of the margin softness parameter (i.e.  of the later equation (14)). We can now show that there exist no non-trivial r.k.h.s.'s with radial kernels that are also Ws -scaled for all s = 0. First however we need the following standard result on homogeneous functions: Lemma 3.2. If  : [0, )  R and g : (0, )  R satisfy (r) = g (s)(rs) for all r  0 and s > 0 then (r) = a (r) + brp and g (s) = s-p , where a, b, p  R, p = 0, and  is Dirac's function. Which we prove in the accompanying technical report (Walder & Chapelle, 2007). Now, suppose that H is an r.k.h.s. with r.k. k on Rd в Rd . If H is Ta -invariant for all a  Rd then k (x, y ) = k (T-y x, T-y y ) = k (x - y , 0) T (x - y ). If in addition to this H is OA -invariant for all orthogonal A, then by choosing A such that A(x-y ) = x - y e where e is an arbitrary unit vector in Rd we have ^ ^ k (x, y ) = k (OA x, OA y ) = T (OA (x - y )) = T ( x - y e) OT ( x - y ) ^ i.e. k is radial. All of this is straightforward, and a similar analysis can be found in (Wendland, 2004). Indeed the widely used Gaussian kernel satisfies both of the above invariances. But if we now also assume that H is Ws -scaled for all s = 0 -- this time with arbitrary gWs (H) -- then k (x, y ) = gWs (H)k (Ws x, Ws y ) = gW|s| (H)OT (|s| x - y ) so that letting r = x - y we have that OT (r) = gW|s| (H)OT (|s| r) and hence by Lemma 3.2 that OT (r) = a (r) + brp where a, b, p  R. This is positive semi-definite for the trivial case p = 0, but there are various ways of showing this cannot be non-trivially positive semi-definite for p = 0. One simple way is to consider two arbitrary vectors x1 and x2 such that x1 - x2 = d > 0. For the corresponding Gram matrix , bdp K a bdp a to be positive semi definite we require 0  det(K ) = a2 - b2 d2p , but for arbitrary d > 0 and a < , this implies b = 0. This may seem disappointing, but fortunately there do exist c.p.d. kernel functions with the stated properties, such as the thin-plate kernel. We discuss this case in detail in Section 5, after the following particularly elementary and in part novel introduction to c.p.d. kernels.

4 Conditionally Positive Definite Kernels
In the last Section we alluded to c.p.d. kernel functions ґ these are given by the following Definition 4.1. A continuous function  : X в X  R is conditionally positive definite with respect to (w.r.t.) the linear space of functions P if, for all m  N, all {xi }i=1...m  X , and all m   Rm \ {0} satisfying j =1 j p(xj ) = 0 for all p  P , the following holds m (4) j,k=1 j k (xj , xk ) > 0. Due to the positivity condition (4) -- as opposed one of non negativity -- we are referring to c.p.d. rather than conditionally positive semi-definite kernels. The c.p.d. case is more technical than the p.d. case. We provide a minimalistic discussion here -- for more details we recommend e.g. (Wendland, 2004). To avoid confusion, let us note in passing that while the above definition is quite standard (see e.g. (Wendland, 2004; Wahba, 1990)), many authors in the machine learning community ╗ use a definition of c.p.d. kernels which corresponds to our definition when P = {1} (e.g. (Scholkopf & Smola, 2002)) or when P is taken to be the space of polynomials of some fixed maximum degree (e.g. (Smola et al., 1998)). Let us now adopt the notation P  (x1 , . . . , xm ) for the set m {  Rm : i=1 i p(xi ) = 0 for all p  P } . The c.p.d. kernels of Definition 4.1 naturally define a Hilbert space of functions as per Definition 4.2. Let  : X в X  R be a c.p.d. kernel w.r.t. P . We define F (X ) to be the Hilbert space of functions which is the completion of the set m , j (Ї, xj ) : m  N, x1 , .., xm  X ,   P  (x1 , .., xm ) j =1 which due to tD e definition of  we may endow with the inner product h E
Pm
j =1

j (Ї, xj ),

Pn

k=1

k (Ї, yk )

=

Pm Pn
j =1

F ( X )

k=1

j k (xj , yk ).

(5)

3


Note that  is not the r.k. of F (X ) -- in general (x, Ї) does not even lie in F (X ). For the remainder of this Section we develop a c.p.d. analog of the representer theorem. We begin with Lemma 4.3. Let  : X в X  R be a c.p.d. kernel w.r.t. P and p1 , . . . pr a basis for P . For any {(x1 , y1 ), . . . (xm , ym )}  X в R, there exists an s = sF (X ) + sP where sF (X ) = m r j =1 j (Ї, xj )  F (X ) and sP = k=1 k pk  P , such that s(xi ) = yi , i = 1 . . . m. A simple and elementary proof (which shows (17) is solvable when  = 0), is given in (Wendland, 2004) and reproduced in the accompanying technical report (Walder & Chapelle, 2007). Note that although such an interpolating function s always exists, it need not be unique. The distinguishing property of the interpolating function is that the norm of the part which lies in F (X ) is minimum. Definition 4.4. Let  : X в X  R be a c.p.d. kernel w.r.t. P . We use the notation P (P ) to denote the projection F (X )  P  F (X ). m Note that F (X )  P (P ) is a direct sum since p = j =1 i (zj , Ї)  P  F (X ) implies m n m 2 p F (X ) = p, p F (X ) = i=1 j =1 i j (zi , zj ) = j =1 j p(zj ) = 0. Hence, returning to the main thread, we have the following lemma -- our proof of which seems to be novel and particularly elementary. Lemma 4.5. Denote by  : X в X  R a c.p.d. kernel w.r.t. P andm y p1 , . . . pr a basis for b P . Consider an arbitrary function s = sF (X ) + sP with sF (X ) = j =1 j (Ї, xj )  F (X ) r and sP = k=1 k pk  P . P (P )s F (X )  P (P )f F (X ) holds for all f  F (X )  P satisfying f (xi ) = s(xi ), i = 1 . . . m. (6) Proof. Let f be an arbitrary element of F (X )  P . We can always write f as f= jm
=1

(i + i ) (Ї, xj ) +

ln
=1

bl (Ї, zl ) +

kr
=1

ck pk .

If we define1 [Px ]i,j = pj (xi ), [Pz ]i,j = pj (zi ), [xx ]i,j = (xi , xj ), [xz ]i,j = (xi , zj ), and [zx ]i,j = (zi , xj ), then the condition (6) can hence be written Px  = xx  + xz b + Px c, (7) and the definition of F (X ) requires that e.g.   P  (x1 , . . . , xm ), hence implying the constraints Px  = 0 and Px ( + ) + Pz b = 0. The inequality to be demonstrated is then  + L  xx   b (8)


xx

z x

 xz z z


+ b

R . (9)

By expanding R= = xx
L 

 + b

  b
1

 +2 0

  b
2

,

P t it follows from (8) that Px  + Pz  = 0, and since  is c.p.d. w.r.t. x Pz hat 1  0. But (7) and (8) imply that L  R, since =0   P    2 =  xx  +  xz b = x ( - c) -  xz b +  xz b = 0.
1

Square brackets w/ subscripts denote matrix elements, and colons denote entire rows or columns.

4


Using these results it is now easy to prove an analog of the representer theorem for the p.d. case. Theorem 4.6 (Representer theorem for the c.p.d. case). Denote by  : X в X  R a c.p.d. kernel w.r.t. P , by  a strictly monotonic increasing real-valued function on [0, ), and by c : Rm  R  {} an arbitrary cost function. There exists a minimiser over F (X )  P of ( P 10) W (f ) c (f (x1 ), . . . , f (xm )) +   (P )f 2  (X ) F m which admits the form i=1 i (Ї, xi ) + p, where p  P . m Proof. Let f be a minimiser of W. Let s = i=1 i (Ї, xi ) + p satisfy s(xi ) = f (xi ), i = 1 . . . m. By Lemma 4.3 we know that such an s exists. But by Lemma 4.5 P (P )s 2  (X )  F P (P )f 2  (X ) . As a result, W (s)  W (f ) and s is a minimizer of W with the correct form. F

5 Thin-Plate Regulariser
Definition 5.1. The m-th order thin-plate kernel m : Rd в Rd  R is given by ( m-d -1)m-(d-2)/2 x - y 2 log( x - y ) if d  2N, m (x, y ) = m-d m-(d-1)/2 (-1) x-y 2 if d  (2N - 1),

(11)

for x = y , and zero otherwise. m is c.p.d. with respect to m-1 (Rd ), the set of d-variate pRlynooo mials of degree at most m - 1. The kernel induces the following norm on the space Fm d f Definition 4.2 (this is not obvious -- see e.g. (Wendland, 2004; Wahba, 1990)) f, g F =
m (R d)

d

d

i
1 =1

ЇЇЇ

i
m =1

f,  g L2 (Rd )   ЇЇЇ
x1 =- xd =-

  xi1 ЇЇЇ

 f  xim

  xi1 ЇЇЇ

 g  xim

d x1 . . . dxd ,

where  : Fm

Rd 

Clearly gOA (Fm

L2 (Rd ) is a regularisation operator, implicitly defined above. Rd ) R) = gTa (Fm d = 1. Moreover, from the chain rule we have      ЇЇЇ (f  Ws ) = sm ЇЇЇ f Ws  xi1  xim  xi1  xim

(12)

and therefore since f, g L2 (Rd ) = sd f  Ws , g  Ws L2 (Rd ) ,we can immediately write  (f  Ws ) ,  (g  Ws ) L2 (Rd ) = s2m (f )  Ws , ( g )  Ws L2 (Rd ) = s2m-d f,  g L2 (Rd ) (13) R) so that gWs (Fm d = s-(2m-d) . Note that although it may appear that this can be shown more easily using (11) and an argument similar to Lemma 3.1, the process is actually more involved due to the log factor in the first case of (11), and it is necessary to use the fact that the kernel is c.p.d. w.r.t. m-1 (Rd ). Since this is redundant and not central to the paper we omit the details.

6 Conditionally Positive Definite s.v.m.
In the Section 3 we showed that non-trivial kernels which are both radial and dilation scaled cannot be p.d. but rather only c.p.d. It is therefore somewhat surprising that the s.v.m. -- one of the most widely used kernel algorithms -- has been applied only with p.d. kernels, or kernels which are c.p.d. respect only to P = {1} (see e.g. (Boughorbel et al., 2005)). After all, it seems interesting to construct a classifier independent not only of the absolute positions of the input data, but also of their absolute multiplicative scale. Hence we propose using the thin-plate kernel with the s.v.m. by minimising the s.v.m. objective over the space F (X )  P (or in some cases just over F (X ), as we shall see in Section 6.1). For this we require somewhat non-standard s.v.m. optimisation software. The method we propose seems simpler and more robust than previously mentioned solutions. For example, (Smola et al., 1998) mentions the numerical instabilities which may arise with the direct application of standard solvers. 5


Dataset banana breast diabetes flare german heart

Gaussian 10.567 (0.547) 26.574 (2.259) 23.578 (0.989) 36.143 (0.969) 24.700 (1.453) 17.407 (2.142)

Thin-Plate 10.667 (0.586) 28.026 (2.900) 23.452 (1.215) 38.190 (2.317) 24.800 (1.373) 17.037 (2.290)

dim/n 2/3000 9/263 8/768 9/144 20/1000 13/270

Dataset image ringnm splice thyroid twonm wavefm

Gaussian 3.210 (0.504) 1.533 (0.229) 8.931 (0.640) 4.199 (1.087) 1.833 (0.194) 8.333 (0.378)

Thin-Plate 1.867 (0.338) 1.833 (0.200) 8.651 (0.433) 3.247 (1.211) 1.867 (0.254) 8.233 (0.484)

dim/n 18/2086 20/3000 60/2844 5/215 20/3000 21/3000

Table 1: Comparison of Gaussian and thin-plate kernel with the s.v.m. on the UCI data sets. Results are reported as "mean % classification error (standard error)". dim is the input dimension and n the total number of data points. A star in the n column means that more examples were available but we kept only a maximum of 2000 per class in order to reduce the computational burden of the extensive number of cross validation and model selection training runs (see Section 7). None of the data sets were linearly separable so we always used used the normal ( unconstrained) version of the optimisation described in Section 6.1.

6.1

Optimising an s.v.m. with c.p.d. Kernel

It is simple to implement an s.v.m. with a kernel  which is c.p.d. w.r.t. an arbitrary finite dimensional space of functions P by extending the primal optimisation approach of (Chapelle, 2007) to the c.p.d. case. The quadratic loss s.v.m. solution can be formulated as arg minf F (X )P of in  P (P )f 2  (X ) + max(0, 1 - yi f (xi ))2 , (14) F
=1

Note that for the second order thin-plate case we have X = Rd and P = 1 (Rd ) (the space of constant and first order polynomials). Hence dim (P ) = d + 1 and we can take the basis to be pj (x) = [x]j for j = 1 . . . d along with pd+1 = 1. It follows immediately from Theorem 4.6 that, letting p1 , p2 , . . . pdim(P ) span P , the solution to (14) n dim(P ) is given by fsvm (x) = i=1 i (xi , x) + j =1 j pj (x). Now, if we consider only the margin violators -- those vectors which (at a given step of the optimisation process) satisfy yi f (xi ) < 1, we can replace the max (0, Ї) in (14) with (Ї). This is equivalent to making a local second order approximation. Hence by repeatedly solving in this way while updating the set of margin violators, we will have implemented a so-called Newton optimisation. Now, since
n

P (P )fsvm 2  (X ) = F

i
,j =1

i j (xi , xj ),

(15)

the local approximation of the problem is, in  and  minimise 


 +  + P  - y 2 , subject to P


= 0,

(16)

where []i,j = (xi , xj ), [P ]j,k = pk (xj ), and we assumed for simplicity that all vectors violate the margin. The solution in this case is given by (Wahba, 1990) = -1 y . IP  P + (17)  0 0 In practice it is essential that one makes a change of variable for  in order to avoid the numerical problems which arise when P is rank deficient or numerically close to it. In particular we make the QR factorisation (Golub & Van Loan, 1996) P = QR, where Q Q = I and R is square. We then solve for  and  = R . As a final step at the end of the optimisation process, we take the minimum norm solution of the system  = R ,  = R#  where R# is the pseudo inverse of R. Note that although (17) is standard for squared loss regression models with c.p.d. kernels, our use of it in optimising the s.v.m. is new. The precise algorithm is given in (Walder & Chapelle, 2007), where we also detail two efficient factorisation techniques, specific to the new s.v.m. setting. Moreover, the method we present in Section 6.2 deviates considerably further from the existing literature. 6


6.2

Constraining  = 0

Previously, if the data can be separated with only the P part of the function space -- i.e. with  = 0 -- then the algorithm will always do so regardless of . This is correct in that, since P lies in the null space of the regulariser P (P )Ї 2  (X ) , such solutions minimise (14), but may be undesirable for F various reasons. Firstly, the regularisation cannot be controlled via . Secondly, for the thin-plate, P = 1 (Rd ) and the solutions are simple linear separating hyperplanes. Finally, there may exist infinitely many solutions to (14). It is unclear how to deal with this problem -- after all it implies that the regulariser is simply inappropriate for the problem at hand. Nonetheless we still wish to apply a (non-linear) algorithm with the previously discussed invariances of the thin-plate. To achieve this, we minimise (14) as before, but over the space F (X ) rather than F (X )  P . It is important to note that by doing so we can no longer invoke Theorem 4.6, the representer theorem for the c.p.d. case. This is because the solvability argument of Lemma 4.3 no longer holds. Hence we do not know the optimal basis for the function, which may involve infinitely many (Ї, x) terms. The way we deal with this is simple -- instead of minimising over F (X ) we consider only the finite dimensional subspace given by n , j (Ї, xj ) :   P  (x1 , . . . , xn ) j =1 where x1 , . . . xn are those of the original problem (14). The required update equation can be acquired in a similar manner as before. The closed form solution to the constrained quadratic programme is in this case given by (see (Walder & Chapelle, 2007)) P P -1  s  = -P   + sx sx  P  x ys (18) where sx = []s,: , s is the current set of margin violators and P the null space of P satisfying P P = 0. The precise algorithm we use to optimise in this manner is given in the accompanying technical report (Walder & Chapelle, 2007), where we also detail efficient factorisation techniques.

7 Experiments and Discussion
We now investigate the behaviour of the algorithms which we have just discussed, namely the thinplate based s.v.m. with 1) the optimisation over F (X )  P as per Section 6.1, and 2) the optimisation over a subspace of F (X ) as per Section 6.2. In particular, we use the second method if the data is linearly s- arable, otherwise we use the first. For a baseline we take the Gaussian kernel ep , k (x, y ) = exp x - y 2 /(2 2 ) and compare on real world classification problems. Binary classification (UCI data sets). Table 1 provides numerical evidence supporting our claim that the thin-plate method is competitive with the Gaussian, in spite of it's having one less hyper parameter. The data sets are standard ones from the UCI machine learning repository. The experiments are extensive -- the experiments on binary problems alone includes all of the data sets used in (Mika et al., 2003) plus two additional ones (twonorm and splice). To compute each error measure, we used five splits of the data and tested on each split after training on the remainder. For parameter selection, we performed five fold cross validation on the four-fifths of the data available for training each split, over an exhaustive search of the algorithm parameter(s) ( and  for the Gaussian and happily just  for the thin-plate). We then take the parameter(s) with lowest mean error and retrain on the entire four fifths. We ensured that the chosen parameters were well within the searched range by visually inspecting the cross validation error as a function of the parameters. Happily, for the thin-plate we needed to cross validate to choose only the regularisation parameter , whereas for the Gaussian we had to choose both  and the scale parameter  . The discovery of an equally effective algorithm which has only one parameter is important, since the Gaussian is probably the most popular and effective kernel used with the s.v.m. (Hsu et al., 2003). Multi class classification (USPS data set). We also experimented with the 256 dimensional, ten class USPS digit recognition problem. For each of the ten one vs. the rest models we used five fold cross validation on the 7291 training examples to find the parameters, retrained on the full training set, and labeled the 2007 test examples according to the binary classifier with maximum output. The Gaussian misclassified 88 digits (4.38%), and the thin-plate 85 (4.25%). Hence the Gaussian did not perform significantly better, in spite of the extra parameter. 7


Computational complexity. The normal computational complexity of the c.p.d. s.v.m. algorithm is the usual O(nsv 3 ) -- cubic in the number of margin violators. For the  = 0 variant (necessary only on linearly separable problems -- presently only the USPS set) however, the cost is O(nb 2 nsv + nb 3 ), where nb is the number of basis functions in the expansion. For our USPS experiments we expanded on all m training points, but if nsv m this is inefficient and probably unnecessary. For example the final ten models (those with optimal parameters) of the USPS problem had around 5% margin violators, and so training each Gaussian s.v.m. took only  40s in comparison to  17 minutes (with the use of various efficient factorisation techniques as detailed in the accompanying (Walder & Chapelle, 2007) ) for the thin-plate. By expanding on only 1500 randomly chosen points however, the training time was reduced to  4 minutes while incurring only 88 errors -- the same as the Gaussian. Given that for the thin-plate cross validation needs to be performed over one less parameter, even in this most unfavourable scenario of nsv m, the overall times of the algorithms are comparable. Moreover, during cross validation one typically encounters larger numbers of violators for some suboptimal parameter configurations, in which cases the Gaussian and thin-plate training times are comparable.

8 Conclusion
We have proven that there exist no non-trivial radial p.d. kernels which are dilation invariant (or more accurately, dilation scaled), but rather only c.p.d. ones. Such kernels have the advantage that, to take the s.v.m. as an example, varying the absolute multiplicative scale (or length scale) of the data has the same effect as changing the regularisation parameter -- hence one needs model selection to chose only one of these, in contrast to the widely used Gaussian kernel for example. Motivated by this advantage we provide a new, efficient and stable algorithm for the s.v.m. with arbitrary c.p.d. kernels. Importantly, our experiments show that the performance of the algorithm nonetheless matches that of the Gaussian on real world problems. The c.p.d. case has received relatively little attention in machine learning. Our results indicate that it is time to redress the balance. Accordingly we provided a compact introduction to the topic, including some novel analysis which includes an new, elementary and self contained derivation of one particularly important result for the machine learning community, the representer theorem.

References
Boughorbel, S., Tarel, J.-P., & Boujemaa, N. (2005). Conditionally positive definite kernels for svm based image recognition. Proc. of IEEE ICME'05. Amsterdam. Chapelle, O. (2007). Training a support vector machine in the primal. Neural Computation, 19, 1155ґ1178. ╗ Chapelle, O., & Scholkopf, B. (2001). Incorporating invariances in nonlinear support vector machines. In T. Dietterich, S. Becker and Z. Ghahramani (Eds.), Advances in neural information processing systems 14, 609ґ616. Cambridge, MA: MIT Press. Fleuret, F., & Sahbi, H. (2003). Scale-invariance of support vector machines based on the triangular kernel. Proc. of ICCV SCTV Workshop. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations. Baltimore MD: The Johns Hopkins University Press. 2nd edition. Hsu, C.-W., Chang, C.-C., & Lin, C.-J. (2003). A practical guide to support vector classification (Technical Report). National Taiwan University. ╗ ╗ ╗ Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Smola, A., & Muller, K.-R. (2003). Constructing descriptive and discriminative non-linear features: Rayleigh coefficients in feature spaces. IEEE PAMI, 25, 623ґ628. ╗ Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press. ╗ ╗ Smola, A., Scholkopf, B., & Muller, K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11, 637ґ649. Wahba, G. (1990). Spline models for observational data. Philadelphia: Series in Applied Math., Vol. 59, SIAM. Walder, C., & Chapelle, O. (2007). Learning with transformation invariant kernels (Technical Report 165). ╗ Max Planck Institute for Biological Cybernetics, Department of Empirical Inference, Tubingen, Germany. Wendland, H. (2004). Scattered data approximation. Monographs on Applied and Computational Mathematics. Cambridge University Press.

8