Dimension and Margin Bounds for Reflection-invariant Kernels Thorsten Doliwa, Michael Kallweit, and Hans Ulrich Simon Fakultat fur Mathematik, Ruhr-Universit¨t Bochum, D-44780 Bochum, Germany ¨¨ a {thorsten.doliwa,michael.kallweit,hans.simon}@rub.de Abstract A kernel over the Boolean domain is said to be reflection-invariant, if its value does not change when we flip the same bit in both arguments. (Many popular kernels have this property.) We study the geometric margins that can be achieved when we represent a specific Boolean function f by a classifier that employs a reflectioninvariant kernel. It turns out f is an ^ upper bound on the average margin. Fur^- thermore, f 1 is a lower bound on the smallest dimension of a feature space associated with a reflection-invariant kernel that allows for a correct representation of f . This is, to the best of our knowledge, the first paper that exhibits margin and dimension bounds for specific functions (as opposed to function families). Several generalizations are considered as well. The main mathematical results are presented in a setting with arbitrary finite domains and a quite general notion of invariance. 1 Introduction There has been much interest in margin and dimension bounds during the last decade. The simplest way to cast (most of ) the existing results in this direction is offered by the notion of margin and dimension complexity associated with a given sign matrix A {-1, 1}m×n. A linear arrangement, given by unit vectors u1 , . . . , um ; v1 , . . . , vn (taken from an inner product space), is said to represent A if, for all i = 1, . . . , m and j = 1, . . . , n, Ai,j = sign( ui , vj ). The dimension complexity of This work was supp orted in part by the IST Programme of the Europ ean Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors' views. This work was furthermore supp orted by the Deutsche Forschungsgemeinschaft Grant SI 498/8-1. A is the smallest dimension of an inner product space that allows for such a representation. The margin complexity is obtained similarly by looking for the linear arrangement that leads to the maximum average margin (or, alternatively, to the maximum margin that can be guaranteed for all choices of i and j ). Applying counting arguments, Ben-David, Eiron, and Simon [1] have shown that, loosely speaking, an overwhelming ma jority of sign matrices of small VC-dimension do not allow for a linear arrangement whose margin or dimension is significantly better than what can be guaranteed in a trivial fashion. Starting with Forster's celebrated exponential lower bound on the dimension complexity of the Walsh-Hadamard matrix [4], there has been a series of papers [5, 6, 10, 7, 13, 15] presenting (increasingly powerful) techniques for deriving upper margin bounds or lower dimension bounds on the complexity of sign matrices. Note that a sign matrix represents a family of Boolean functions, one Boolean function per column say. The lack of non-trivial margin or dimension bounds for a specific Boolean function has a simple explanation: a specific function f (x) can always trivially be represented in a 1-dimensional space with geometric margin 1 by mapping an instance x {-1, 1}n to f (x) {-1, 1}. The corresponding kernel would map a pair (x, x ) of instances to 1 if f (x) = f (x ), and to -1 otherwise. Clearly, the 1-dimensional "linear arrangement" for f does not say much about the ability of kernelbased large margin classifier systems to "learn" f because we would need to know f perfectly prior to the choice of the kernel. (If we had this knowledge, there would be nothing to learn anymore.) Nevertheless, this discussion shows that one cannot expect non-trivial margin or dimension bounds for specific functions that hold uniformly for al l kernels. In this paper, we introduce the concept of distributed functions that are invariant under a group G of transformations. We present the mathematical results about invariant distributed functions in a quite general setting (because it does not make sense to impose unnecessary restrictions). In particular, we derive non-trivial margin and dimension bounds for specific Boolean functions that are valid for all linear arrangements resulting from G invariant kernels. If the domain of the distributed function can be cast as a finite Abelian group, the margin and dimension bounds for a function f can be nicely expressed in terms of f 's Fourier-spectrum. As always, f denotes the largest absolute value ^ found in the spectrum of f 's Fourier-coefficients. We show that f is an upper bound on the largest ^ possible average margin, and f -1 is a lower bound ^ on the smallest possible dimension. Our general results easily apply to a special case of high learningtheoretic relevance, namely the reflection-invariant kernels. Their relevance comes from the fact that, as demonstrated in the paper, many popular kernels actually happen to be reflection-invariant. The remainder of the paper is structured as follows. In Section 2, we fix some notation and recall some facts about Fourier-expansions over finite Abelian groups and kernel-based classification. In Section 3, we present our results for arbitrary finite domains and a quite general notion of invariance. In Section 4, we introduce the concept of rotation-invariance and mention some connections between the Fourier-expansion over an arbitrary finite Abelian group and the spectral decomposition of such functions. In Section 5, we consider distributed functions over the Boolean domain and the concept of reflection-invariance, which is simply rotation-invariance over a Boolean domain. Section 6 presents the margin and dimension bounds that are valid for reflection-invariant kernels. Section 7 offers a possible interpretation of our results, and mentions a connection to a recent paper by Haasdonk and Burkhardt [8] along with some open problems. and recall the Fourier-expansion over finite Abelian groups as well as the notion of margin in kernelbased classification. 2.1 Preliminaries Throughout the paper, denotes the Kroneckersymbol, i.e., (a, b) = 1 if a = b and (a, b) = 0 otherwise. For two n-dimensional vectors x, y , we define x y to be the vector obtained by multiplying x and y componentwise, i.e., (x y )i := xi yi for i = 1, . . . , n. The n-dimensional "all-ones vector" is given by e = (1, . . . , 1) . The vector with 1 in component k and zeros elsewhere is denoted as ek . We consider functions over a finite domain D with values in (or in resp.). These functions form a |D|-dimensional vector space. A distributed function over D is a function over the domain D × D. We will occasionally identify a distributed function f over D with the (D × D)-matrix F given by Fx,y = f (x, y ). Ê , 2. 2 Fourier-expansions over Finite Ab elian Groups Let (D, +) be a finite Abelian group of size d = |D|. called a character over D A function : D if, for every x, y D, is 2 Definitions and Notations We assume familiarity with basics in matrix and learning theory. For example, notions like · singular values, eigenvalues, spectral norm · kernels, feature map, Reproducing Kernel Hilbert Space are assumed as known (although we shall occasionally refresh the readers memory). Some central definitions and facts concerning · linear arrangements representing a given sign matrix, · margin and dimension associated with such a linear arrangement, will be given later in the paper at the place where it is required. In the following we fix some notation We may fix a bijection between D and the set of characters and write z for the character that corresponds to z D. Every function f : D be written in the form z ^ f (x) = f (z ) · z (x) (2) It is well-known that there are exactly d characters, and they form an orthonormal basis of the vector space with respect to the inner product 1x (1) f (x) · g (x) . f, g := · d (x + y ) = (x) · (y ) . D D can D where Equation (2) is referred to as the Fourier expansion ^ of f , and f (z ) is called the Fourier-coefficient of f at z . According to the "Fundamental Theorem for Finitely Generated Abelian Groups", every finite Abelian group is, up to isomorphism, of the form D= 1y ^ f (z ) := f, z = · d D f (y ) · z (y ) . q 1 for some sequence q1 , . . . , qn of prime powers. Equation (3) is assumed henceforth so that d = |D| = kn qk . × ···× q n (3) =1 It is well-known that the characters over are given by (m) jk k (j ) = m , where 2 i i m = ex p m s a primitive root of unity of order m. The characters over D are then given by z (x) = kn ( zqk ) (xk ) k m Focusing on the margin that is guaranteed for every x D, we should consider the function µK (f |) := min µK (f |, x) . x D By taking the supremum over all : D , we get the respective parameters of a large margin classifier employing kernel function K : µK (f ) := µK (f ) := :D Ê sup µK (f |) . Ê Ê =1 :D sup µK (f |) Consider now the matrix H = (Hx,z )x,zD given by Hx,z = z (x) . (4) It is obvious that H is symmetric. By the orthonormality of the characters with respect to the inner product in (1), it follows that where I denotes the identity matrix. H · H = H · H = d · I , Finally, taking the supremum ranging over all K from a given kernel class C , we get the respective parameters of a best possible large margin classifier among those that employ a kernel from C : µC (f ) µC (f ) := := K C K C sup µK (f ) sup µK (f ) 2.3 Kernel-based Classification Let K : D × D be a valid kernel over a finite domain D. In other words, K (x, y ) is a realvalued distributed function over D which, considered as matrix, is symmetric and positive semidefinite. LetK be the feature map and ·, · K the inner product that represent K in the Reproducing Kernel Hilbert Space, and let · K be the norm induced by ·, · K.1 Then satisfies Ê We briefly note that, obviously, the guaranteed margin is upper bounded by the average margin: µK (f |) µK (f ) µC (f ) µK (f |) µK (f ) µC (f ) 3 A General Notion of Invariance In the context of "large margin classification", is considered as a classifier that assigns the label sign( w(), (x) ) to input x. Consider a target function f : D {-1, 1} for a binary classification i ask. Then, a negative sign of f (x) · w(), (x) t ndicates a "classification error" on x. So this expression should be positive and it is intuitively even better when it leads to a large positive value. Thus, the following number, called the (geometric) margin achieved by on x w.r.t. target function f and kernel K , is of interest: µK (f |, x) := f (x) · w(), (x) w () · (x) (6) With every "dual vector" : D , we associate the "weight vector" x w() := (x)(x) . (5) D x, y D : K (x, y ) = (x), (y ) . Ê Throughout this section, D denotes an arbitrary finite domain, S (D) is the group of permutations over D, and G S (D) is an arbitrary but fixed subgroup. A distributed function over D with values in V said to be G -invariant if, for all x, y D and every G , the following holds: is f ( (x), (y )) = f (x, y ) We clearly have the Pointwise Closure Prop erty: The pointwise limit of G -invariant functions is a G -invariant function. Furthermore, if f1 , . . . , fd are G -invariant functions and g : V d W is an arbitrary function with values in W then , g (f1 (x, y ), . . . , fd (x, y )) is G -invariant too. More interesting is the the following result: Lemma 1 G -invariant distributed functions over a finite domain D are closed under the usual matrix product and under the tensor-product of matrices. More precisely, let F (x, y ) and G(x, y ) be two G -invariant distributed functions (here viewed as matrices). Then, the functions (F · G)(x, y ) is G invariant and the function (F G)[(u, x), (v , y )] is invariant over G × G (as subgroup of S (D) × S (D)). By averaging over all x D, we obtain the function x µK (f |) := 2-n µK (f |, x) . D In the sequel, we drop index K unless we would like to stress the dep endence on K . 1 Proof: Consider first the function (F · G)(x, y ). Let x, y D and G be arbitrary but fixed. The following calculation shows that it is G -invariant: z (F · G)(x),(y) = F(x),z · Gz,(y) = = = z z D D Pro of: Let = K , · = · K , and ·, · = ·, · K. Clearly, ((x)) = (x) because of ((x)) 2 (= ((x)), ( (x)) (x) 2 . As for the second statement, see the following calculation: w( ) 2 (= w( ), w( ) x = y 5) = (x)(x), (y )(y ) = ( x x D D ,y D = = 7) (x), (x) Fx,-1 (z) · G-1 (z),y Fx,z · Gz,y D Now consider the tensor-product (F G)[(u, x), (v , y )], which is a distributed function over D × D, i.e., a function over domain (D × D) × (D × D). The following calculation shows that it is (G × G )-invariant: (F G)[( (u), (x)), ( (v ), (y ))] = F ( (u), (v )) · G( (x), (y )) = F (u, v ) · G(x, y ) = (F G)[(u, x), (v , y )] (F · G)x,y ( (x))( (y )) (x), (y ) (x)(y ) ( -1 (x)), ( -1 (y )) ,y D = = 7) x ,y D (x)(y ) (x), (y ) 2 w() In this section, we shall show the following. If f : D {-1, 1} is a function on domain D and G is a subgroup of S (D), then the largest average (or largest guaranteed, resp.) margin that can be obtained when f is represented by a G -invariant kernel is upper-bounded by the largest average (or largest guaranteed, resp.) margin that can be obtained for the family Gf := {f : G } where f (x) := f ( (x)) . Since there are classical margin bounds that apply to the family Gf , we obtain corresponding bounds that apply to the single function f . An analogous remark holds for dimension bounds. Details follow. Assume that K (x, y ) is a G -invariant kernel and consider the feature map = K that represents K in the Reproducing Kernel Hilbert Space. Then, for all x, y D and every G , satisfies ((x), ( (y ) = (x), (y ) . (7) Lemma 2 If kernel K is G -invariant, then the following holds for every x D and every G : K ( (x)) K w() K = = K (x) K w( ) K Pro of: The proof starts as follows: f (x) · w( ), (x) (= y f (x) D 5) Lemma 3 For every G -invariant kernel K , and every choice of f : D {-1, 1}, x D, G , and : D , the fol lowing holds: µK (f | , x) = µK (f |, (x)) Ê (y )(y ), (x) = 7) f ( (x)) f ( (x)) y y D ( (y )) (y), (x) (= D ( (y )) ((y)), ( (x)) = ( (y ))( (y )), ( (x)) = f ( (x)) f ( (x)) y y D (y )(y ), ( (x)) D = In other words, the norm · K is constant on feature vectors of instances taken from the same orbit and it assigns the same value to al l dual vectors from the set {w( ) : G } . xG := { (x) : G } U f ( (x)) w(), ( (x)) sing this calculation in combination with Lemma 2, the proof is easy to accomplish: f (w) · w( ), (x) x (6) µK (f | , x) = ( ) · (x) f ( (w )) · w(), ( (x)) x = () · ((x)) (6) = µK (f |, (x)) Corollary 4 For every G -invariant kernel K , and every choice of f : D {-1, 1}, G , and : D , the fol lowing holds: Ê µK (f | ) µK (f | ) µK (f ) µK (f ) µG (f ) µG (f ) = = = = = = µK (f |) µK (f |) µK (f ) µK (f ) µG (f ) µG (f ) As our input space D is finite, we can assume without loss of generality that the Reproducing Kernel Hilbert Space for a kernel K on D coincides with d (K ) for some suitable 1 d(K ) |D|. We say represents target function f corthat : D rectly w.r.t. kernel K if Ê Ê x D : µK (f |, x) > 0 . Corollary 6 Let dG (f ) denote the smal lest dimension of a feature space associated with a G -invariant kernel K that al lows for a correct representation of f . Then, | D| · |G | dG (f ) M f ,G . Pro of: According to Lemma 3, a kernel that allows for a correct representation of f allows also for a correct representation of all f . According to a result by Forster [4], the correspo|nding feature space D| · |G |/ M f ,G . must have dimension at least Corollary 6 can be strengthened slightly: Corollary 7 Let i denote the i-th singular value of M f ,G , where 1 , 2 , . . . are in decreasing order. Then, dG (f ) satisfies the fol lowing lower bound: d G (f ) Note that the last two equations in Corollary 4 basically say that the largest (average or guaranteed) margin that can be achieved for a function f by a large margin classifier is invariant under G (provided that the underlying kernel is G -invariant). Let M {-1, 1}r×s be a sign matrix. Consider a linear arrangement A given by unit vectors u1 , . . . , ur ; v1 , . . . , vs d . The average margin achieved by this arrangement for sign matrix M is defined as follows: r s 1ij T · Mi,j ui , vj µ(M |A) := r s =1 =1 Ê he largest average margin that can be achieved for sign matrix M by any linear arrangement is then given by µ(M ) := sup µ(M |A) , A where the supremum ranges over all linear arrangements A for M . Forster and Simon [7] have shown that, for every M r×s , every d 1, and every choice of unit vectors u1 , . . . , ur ; v1 , . . . , vs in a real inner-product space, the following holds: dG (f ) · i =1 2 i 1 (9) Ê ir js =1 =1 Mi,j ui , vj rs M . Pro of: Let A {-1, 1}r×s be a matrix whose columns are viewed as binary functions f1 , . . . , fs . It has been shown by Forster and Simon [7] that the dimension d of a feature space which allows for a correct representation of f1 , . . . , fs satisfies d· id 2 i (A) rs . From that, we conclude that M . µ(M ) rs Consider the sign matrix M f ,G given by fG Mx,, := f (x) . =1 This trivially implies (9). (8) 4 Rotation-invariant Functions In combination with Corollary 4, we arrive at the following Theorem 5 Let D be a finite domain, and let G be a subgroup of S (D). Then, every function f : D {-1, 1} satisfies µG (f ) | f ,G M . D| · |G | In Section 4.1 we will derive some facts about distributed functions over a finite Abelian group via the Fourier-expansion. Section 4.2 ties everything together and presents the resulting margin and dimension bounds obtained in this restricted setting. 4. 1 Distributed Functions over Finite Ab elian Groups In other words, no large margin classifier that employs a G -invariant kernel can achieve an average f ,G margin for f which exceeds M . |D |·|G | We apply the results of the preceding section to the case where D is a Abelian group of finite size d, and Grot is the subgroup of S (D) consisting of all permutations of the form x x + a. Note that d = |D| = |Grot |. We are interested in distributed functions f : D×D arrange the d2 Fourier-coefficients of such a function as a matrix as follows: and We show that every equivalence class contributes 0 to (12): qk -1 j =0 Fa,b = f (a, -b) ( = d-2 =d -2 (10) f (x, y )(a,-b) (x, y ) (11) y f (x, y )a (x)b (y ) (12) f (x + j ek , y + j ek )a (x + j ek ) · b (y + j ek ) = f (x, y )a (x) · b (y ) qj -1 k =0 x , y ) D ×D · x akk (j )bkk (j ) (q ) (q ) D D The latter sum vanishes because it equals qk -1 j =0 ( qbk -ak )j . k In matrix notation, this reads as F = d-2 · H · F · H , (13) where H is the matrix from (4). A distributed function f (x, y ) over D is said to be rotation-invariant if, for all x, y , a D, the following holds: f (x + a, y + a) = f (x, y ) In the sense of the previous section, f is meant to be Grot -invariant. Here are some examples for rotation-invariant functions: · A distributed function of the form f (x, y ) = g (x - y ) is obviously rotation-invariant. Conversely, any rotation-invariant function f (x, y ) can be written in this form by setting g (x) := f (x, 0) because rotation-invariance implies that f (x, y ) = f (x - y , 0) = g (x - y ) . · Because of the obvious identity z (x - y ) = z (x) · z (y ) , the distributed function z (x)·z (y ) is rotationinvariant too. The fact that f (x, y ) = g (x - y ) is a rotationinvariant function can be restated as follows: any function f (x, y ) that can be cast as a function in x1 - y1 mod q1 , . . . , xn - yn mod qn is rotation-invariant. ^ In terms of the matrix of Fourier-coefficients, F , rotation-invariant functions over D can be characterized as follows: Lemma 8 A distributed function f (x, y ) over D is ^ rotation-invariant iff F is a diagonal matrix. Pro of: Assume first that f (x, y ) is rotation-invariant. ^ Consider a Fourier-coefficient in F outside the main diagonal, say Fa,b so that ak = bk . Every pair (x, y ) can be put into the equivalence class {(x + j ek , y + j ek ) : j = 0, . . . , qk - 1} . Recall that denotes the Kronecker symbol and it is well-known that m- 1 j =0 (l m - l ) j = m · l, l . This shows that Fa,b = 0. ^ Now assume that F is a diagonal matrix. We conclude from (13) that ^ F = H · F · H , which implies that z Fx,y = Fz,z · x (z ) · y (z ) . (14) D Rotation-invariance is now easily obtained: z Fz,z · x+a (z ) · y+a (z ) f (x + a, y + a) = = z D = = f (x, y ) z D Fz,z · z (x + a) · z (y + a) Fz,z · z (x) · z (y ) D In the second-last equation, we used the rotationinvariance of z (x) · z (y ). Corollary 9 Assume that f (x, y ) is a rotation-invariant distributed function over D and let Fx,y = f (x, y ) denote the corresponding matrix. Then the (complex) eigenvalues of d-1 · F are found on the ^ main diagonal of F . Pro of: Rewrite (14) as ^ d-1 F = (d-1/2 H ) · F · (d-1/2 H ) and observe that this is nothing but the spectral ^ decomposition of d-1 F (since F is a diagonal matrix and d-1/2 H is unitary). We briefly note the following result: Pro of: Consider the function fy (x) := f (x - y ). We shall show below that the Fourier coefficients of f and fy are related as follows: fy (z ) = f (z ) · y (z ) . ^ (15) ^ Lemma 10 Let F be the (diagonal) matrix that contains the Fourier-coefficients of the (rotationinvariant) distributed function f (x - y ). Then, for ^ every z D, f (z ) = Fz,z . that M f ,Grot coincides with matrix Fx,y = f (x - y ) up to a permutation of columns (where the column indexed y is exchanged with the column indexed -y ). Since the spectrum of eigenvalues (or singular values, resp.) of a matrix is left invariant under a permutation of columns, we obtain the following Corollary 12 Let f (x - y ) be real-valued, and let F be the matrix with entries Fx,y = f (x - y ). Then, the fol lowing holds: 1. F coincides with the symmetric matrix M f ,Grot up to a permutation of columns. 2. The spectrum of eigenvalues of d-1 · F coincides with the spectrum of (real) eigenvalues of d-1 · M f ,Grot and with the spectrum of Fouriercoefficients of f . 4. 2 Margin and Dimension Bounds for Rotation-invariant Kernels For every function f : D {-1, 1}, µrot (f ) := µGrot (f ) denotes the largest possible average margin that can be achieved by a linear arrangement for f resulting from a rotation-invariant kernel. As for the smallest possible dimension, parameter drot (f ) is understood analogously. Corollary 13 Let D be a finite Abelian group of size d. Every function f : D {-1, 1} satisfies In other words, no large margin classifier that employs a rotation-invariant kernel can achieve an average margin for f which exceeds f . ^ , | f ,Grot M M fdGrot = D| · |Grot | The proof is now obtained by the following calculation: x Fz,z = d-2 · f (x - y ) · z (x) · z (y ) ,y D = = (15) d -1 · d-1 · y y D d -1 · D = ^ f (z ) · d-1 · ^ f (z ) fy (z ) · z (y ) y x fy (x)z (x) D z (y ) = y (z )z (y ) = D 1 The following calculation verifies (15): x fy (z ) = d-1 · f (x - y ) · x (x) = d-1 · = d-1 · d -1 x x D D D = · w w w D D D x f (w) · w (x) · w (y ) · z (x) = f (w) · w (x - y ) · z (x) µrot (f ) f . ^ (16) D w (x) · z (x) f (w) · w (y ) Pro of: According to Theorem 5, d·w,z = Corollary 9 and Lemma 10 yield the following.2 Corollary 11 Let F denote the matrix with entries Fx,y = f (x - y ). Then the spectrum of (complex) eigenvalues of d-1 · F coincides with the spectrum of (complex) Fourier-coefficients of f . . From (8) and Consider the sign matrix M the definition of Grot , we conclude that fG Mx,,y rot = f (x + y ) . f ,Grot f (z ) · z (y ) µrot (f ) . We conclude from Corollary 12 that which leads us to inequality (16). ^ M f ,Grot = F = d · f , Corollary 6 and 7 combined with Corollary 11 lead us to the following results: Corollary 14 Let drot (f ) denote the smal lest dimension of a feature space associated with a rotationinvariant kernel K that al lows for a correct representation of f . Then, drot (f ) f -1 . ^ Pro of: According to Corollary 6, the corresponding feature space for the kernel must have dimension at | least D| · |Grot |/ M f ,Grot = d/ M f ,Grot . According to Corollary 12, the latter expression evaluates to f 1 . ^- is a symmetric matrix. If f is It follows that M real-valued, then M f ,Grot has real eigenvalues. Note This result might b e known, but we are not aware of an appropriate p ointer to the literature. 2 f ,Grot Corollary 15 Let fi denote the i-th Fourier-coefficient of f , where |f1 |, . . . , |fd | are in decreasing order. Then, drot (f ) f 2 i 1 drot (f ) · i =1 ^ · The matrix F whose entries are the Fourier coefficients of f satisfies (13) where H is the matrix from (4). Since D = {-1, 1}n, H equals the well-known (2n × 2n )-Walsh-Hadamard matrix. Distributed functions f (x, y ) over n that satisfy (17) for all x, y n and every a {-1, 1}n are said to be reflection-invariant in the Euclidean space. Here are some examples (with some overlap to our exemplification of rotation-invariant functions in Section 4): Pro of: From (9), we obtain drot (f ) Ê Ê drot (f ) · where i denotes the i-th largest singular value of M f ,Grot . We conclude from Corollary 12, that i coincides with |fi |. i =1 2 i 1 · A distributed function of the form f (x, y ) = g (x y ) is reflection-invariant (in the Euclidean space provided that the domain is n ): Ê 5 Reflection-invariant Functions g ((x a) (y a)) = g (x y (a a)) = g (x y ) Conversely, any reflection-invariant function f (x, y ) (over domain {-1, 1}n) can be written in this form by setting g (x) := f (x, e) because reflectioninvariance implies that f (x, y ) = f (x y , y y ) = f (x y , e) = g (x y ) . · Because of the obvious identity z (x y ) = z (x) · z (y ) , the distributed function z (x)·z (y ) is reflectioninvariant too. · The metric Lp (x - y ) = i n 1/p In this section, we consider real-valued functions only. A distributed function f (x, y ) over {-1, 1}n is said to be reflection-invariant if, for all x, y , a {-1, 1}n, the following holds: Note that reflection-invariance corresponds to rotation-invariance with ( 2 , +) as the underlying (additive) Abelian group is or, equivalently, with ({-1, 1}n, ·) as the underlying (multiplicative) Abelian group. This is because the subgroup Grot of S (D) that we have used for rotation-invariant distributed functions collapses for D = {-1, 1}n (with a multiplicative group structure) to the following subgroup of S ({-1, 1}n): f (x a, y a) = f (x, y ) (17) n Thus, reflection-invariant functions inherit all closure properties that hold, in general, for G -invariant distributed functions (see the Pointwise Closure Property and Lemma 1 in Section 3): Corollary 16 1. The pointwise limit of reflectioninvariant functions is a reflection-invariant function. Furthermore, if f1 , . . . , fd are reflectioninvariant functions and g : d is an arbitrary function, then Gref = {x x a : a {-1, 1}n} =1 |xi - yi | p induced by the Lp -norm is clearly reflectioninvariant in the Euclidean space. In Section 6, we shall see that many popular kernel functions happen to be reflection-invariant. The fact that f (x, y ) = g (x y ) is a reflectioninvariant function can be restated as follows: any function f (x, y ) that can be cast as a function in x1 · y1 , . . . , xn · yn is reflection-invariant. Similarly, any function f (x, y ) that can be cast as a function in Lp (x - y ) (or, more generally, in |x1 - y1 |, . . . , |xn - yn |) is reflection-invariant. Ê Ê g (f1 (x, y ), . . . , fd (x, y )) is reflection-invariant too. 2. Reflection-invariant distributed functions over {-1, 1}n are closed under the usual matrix product and under the tensor-product of matrices. 6 Reflection-invariant Kernels In this section, we consider kernel functions K (x, y ) over the Boolean or over the Euclidean domain. In other words, K (x, y ) is a distributed function over Furthermore, reflection-invariant functions inherit {-1, 1}n or over n with the additional property all properties that hold, in general, for distributed that every finite principal sub-matrix of K is symfunctions over a finite Abelian group: metric and positive semidefinite. In Section 6.1, we · A reflection-invariant function f (x, y ) can be demonstrate that the family of reflection-invariant decomposed according to (2). Since D = {-1, 1}n, kernels is quite rich and contains many popular kerthe character z coincides with thezparity funcnels. In Section 6.2, we derive margin and dimention induced by z , i.e., z (x) = sion bounds for reflection-invariant kernels. =-1 xi . i Ê 6.1 Examples and Closure Prop erties Let us start with some examples. The following (quite popular) kernels (over n except for the DNFKernel that has a Boolean domain) can be cast as functions in x1 · y1 , . . . , xn · yn or as functions in x - y 2 and are therefore reflection-invariant: Ê The proof of Lemma 17 can be looked-up in [3], for example. Corollary 18 If K1 , . . . , Kd are kernels and P : is a polynomial (or a converging power series) with positive coefficients, then Êd Ê Polynomial Kernels: K (x, y ) = p(x y ) for an arbitrary polynomial p with positive coefficients. n All-subsets Kernel: K (x, y ) = i=1 (1 + xi yi ). ANOVA Kernel: Let 1 s n and define Ks (x, y ) = 1 i1 <··· 0. x-y 2 2 2 / Ê, de- for an These kernels have the usual nice properties like being efficiently evaluable although the number of (implicitly represented) features is exponentially large (or even infinite). Polynomial, Exponential, and Gaussian Kernels (first used in [2]) are found in almost any basic text-book that is relevant to the sub ject (e.g. [3]). The All-subsets Kernel is found in [18], and the ANOVA Kernel is found in [19]. As for the latter two kernels, see also [17]. The DNFKernel has been proposed in [16].3 The reader interested in more information about these (and other) kernels may consult the relevant literature. Here, we simply point to the fact that all kernels mentioned above are reflection-invariant. We move on and consider the possibility of making new reflection-invariant kernels from kernels that are already known to be reflection-invariant. To this end, we briefly call into mind some basic closure properties of kernels: Lemma 17 Let K, K1 , K2 be kernels, and let c > 0 be a positive constant. Then, the distributed functions K1 (x, y ) + K2 (x, y ) , c · K (x, y ) K1 (x, y ) · K2 (x, y ) , (K1 K2 )[(u, x), (v , y )] are kernels too. Moreover, the pointwise limit of kernels yields a kernel. In [16], the kernel is defined over the Boolean domain {0, 1}n . Our formula ab ove is obtained from the formula in [16] by plugging in the affine transformation that identifies 1 with -1 and 0 with 1. A similar remark applies to the Monotone DNF-Kernel discussed at the end of this section. 3 von Neumann Diffusion Kernel: For 0 < B -1 , define k K = (I - · B )-1 = k · B k . 0 It follows from the closure properties of reflectioninvariant functions that both diffusion kernels would inherit reflection-invariance from the underlying similarity matrix B . The family of reflection-invariant kernels is quite rich. But here are two kernels (the first-one from [16], and the second-one from [12]) which are counterexamples: Monotone DNF-Kernel: K (x, y ) = -1 + 2-2n in (xj yj - xj - yj + 5) . =1 Sp ectrum Kernel: Here, x, y {-1, 1}n are considered as binary strings. For 1 p n and for every substring u {-1, 1}p, p (x) = |{(u, w) : x = uv w}| v counts how often v occurs as a substring of x. The p-Spectrum Kernel is then given by v K (x, y ) = p (x) · p (y ) . v v {-1,1}p It is easy to see that both kernels are not reflectioninvariant. More generally, string kernels (measuring similarity between strings) often violate reflectioninvariance. 6.2 Margin and Dimension Bounds for Reflection-invariant Kernels For every function f : {-1, 1}n {-1, 1}, µref (f ) := µGref (f ) denotes the largest possible average margin that can be achieved by a linear arrangement for f resulting from a reflection-invariant kernel. Because reflection-invariance is a special case of rotationinvariance, the following result immediately follows from Corollaries 13, 14, and 15: Corollary 19 isfies 1. Every Boolean function f sat- 3. Let fi denote the i-th Fourier-coefficient of f , where |f1 |, . . . , |f2n | are in decreasing order. Then, dref (f ) satisfies the fol lowing lower bound: dref (f ) 2. Let dref (f ) denote the smal lest dimension of a feature space associated with a reflection-invariant kernel K that al lows for a correct representation of f . Then, dref (f ) f -1 . ^ µref (f ) f . ^ In other words, no large margin classifier that employs a reflection-invariant kernel can achieve an average margin for f which exceeds f . ^ that allows for a 1-dimensional halfspace representation of f with margin 1, and note that K actually is invariant under all transformations from T . Thus, no upper margin bound that holds uniformly for all T -invariant kernels can be smaller than 1. Similarly, no lower dimension bound can be larger than 1. Note that this is no contradiction to the main results in this paper because the family {ft : t T } of functions ft (x) = f (t(x)) collapses to the singleton {f }. Thus Forster's margin and dimension bounds applied to this family do not lead to nontrivial values. Viewed from this perspective, our results can be interpreted as follows: one should not use a kernel that is invariant under a set T of transformations if T does not reflect symmetries in the data. The kernel becomes very poor especially when the family {ft : t T } contains much "orthogonality" (which is sort of the opposite of collapsing to a singleton or to a family of highly correlated functions) because Forster's bounds, applied to pairwise (almost) orthogonal functions, are extremely strong. This interpretation makes clear that our results are not particularly surprising but, on the other hand, quantify (in terms of small margin and large dimension bounds) in a meaningful and rigorous fashion an existing mismatch between a kernel and the (missing or existing) symmetries in the data. 7. 2 Op en Problems dref (f ) · i =1 f2 i 1 7 Conclusions and Open Problems We start with some remarks which offer a possible interpretation of our results. Finally, some open problems are mentioned. 7.1 Discussion of our Results Ideally the invariance-properties of a kernel reflect symmetries in the data. For example, assume that there exists a set of transformations, say T , so that, for every instance x D and every transformation t T , the label assigned to x by target function f equals the label assigned to t(x) by f . Then, it looks desirable to apply a kernel that is invariant under the transformations from T . It would be surprising if our results implied that such kernels (that sort of perfectly model the symmetries in the data) would inherently lead to small margins or high-dimensional feature spaces. It is, however, easy to argue that (as expected) the contrary is true and our margin and dimension bounds trivialize whenever the invariance of the kernel perfectly matches with symmetries in the data. To see this, consider again (compare with the introduction) the "super-kernel" + 1 if f (x) = f (y ) K (x, y ) = -1 otherwise Haasdonk and Burkhardt [8] consider two notions of invariance: "simultaneous invariance" and "total invariance". Simultaneous invariance very much corresponds to the notion of invariance that we discussed in Section 3 so that our margin and dimension bounds apply. Total invariance is a stronger notion so that our bounds apply more than ever. But the obvious challenge is to find stronger margin and dimension bounds for total ly invariant kernels. The basic idea behind our paper is roughly as follows. For a family of kernels (e.g., polynomial kernels), we argue that the existence a "good representation" for a particular target function implies the existence of a "good representation" for a whole family of target functions (so that classical margin and dimension bounds can be brought into play). We think that invariance under a group operation (the notion considered in this paper) is just the first obvious thing one should consider. We would like to develop more versatile techniques that, while following the same basic idea, lead to strong margin and dimension bounds for a wider class of kernels. References [1] Shai Ben-David, Nadav Eiron, and Hans Ulrich Simon. Limitations of learning via embeddings in euclidean half-spaces. Journal of Machine Learning Research, 3:441­461, 2002. [2] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages 144­152, 1992. [3] Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [4] Jurgen Forster. A linear lower bound on ¨ the unbounded error communication complexity. Journal of Computer and System Sciences, 65(4):612­625, 2002. [5] Jurgen Forster, Matthias Krause, Satya¨ narayana V. Lokam, Rustam Mubarakzjanov, Niels Schmitt, and Hans Ulrich Simon. Relations between communication complexity, linear arrangements, and computational complexity. In Proceedings of the 21'st Annual Conference on the Foundations of Software Technology and Theoretical Computer Science, pages 171­182, 2001. [6] Jurgen Forster, Niels Schmitt, Hans Ulrich Si¨ mon, and Thorsten Suttorp. Estimating the optimal margins of embeddings in euclidean half spaces. Machine Learning, 51(3):263­281, 2003. [7] Jurgen Forster and Hans Ulrich Simon. On ¨ the smallest possible dimension and the largest possible margin of linear arrangements representing given concept classes. Theoretical Computer Science, 350(1):40­48, 2006. [8] Bernard Haasdonk and Hans Burkhardt. Invariant kernel functions for pattern analysis and machine learning. Machine Learning, 68(1):35­61, 2007. [9] Jaz S. Kandola, John Shawe-Taylor, and Nello Cristianini. Learning semantic similarity. In Advances in Neural Information Processing Systems 15, pages 657­664. MIT Press, 2003. [10] Eike Kiltz and Hans Ulrich Simon. Threshold circuit lower bounds on cryptographic functions. Journal of Computer and System Sciences, 71(2):185­212, 2005. [11] Risi I. Kondor and John D. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In Proceedings of the 19th International Conference on Machine Learning, pages 315­ 322, 2002. [12] Christina Leslie, Eleazar Eskin, and William S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In Pacific Symposium on Biocomputing, pages 564­575, 2002. [13] Nathan Linial, Shahar Mendelson, Gideon Schechtman, and Adi Shraibman. Complexity measures of sign matrices. Combinatorica. To appear. [14] Nati Linial and Adi Shraiman. Lower bounds in communication complexity based on factorization norms. In Proceedings of the 39th [15] [16] [17] [18] [19] Annual Symposium on Theory of Computing, pages 699­708, 2007. Alexander A. Razborov and Alexander A. Sherstov. The sign-rank of AC0 . Personal Communication. Ken Sadohara. Learning of boolean functions using support vector machines. In Proceedings of the 12th International Conference on Algorithmic Learning Theory, pages 106­118, 2001. John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. Eiji Takimoto and Manfred K. Warmuth. Pathe kernels and multiplicative updates. Journal of Machine Learning Research, 4:773­ 818, 2003. Vladimir Vapnik, Christopher J. C. Burges, Bernhard Schoelkopf, and R. Lyons. A new method for constructing artificial neural networks. Interim ARPA Technical Report, AT&T Bell Laboratories, 1995.