The Skew Sp ectrum of Graphs Risi Kondor risi@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, UCL, 17 Queen Square, London, WC1N 3AR, U.K. Karsten M. Borgwardt kmb51@cam.ac.uk Department of Engineering, University of Cambridge, Trumpington Street, Cambridge, CB2 1PZ, U.K. Abstract The central issue in representing graphstructured data instances in learning algorithms is designing features which are invariant to permuting the numbering of the vertices. We present a new system of invariant graph features which we call the skew spectrum of graphs. The skew spectrum is based on mapping the adjacency matrix of any (weigted, directed, unlabeled) graph to a function on the symmetric group and computing bispectral invariants. The reduced form of the skew spectrum is computable in O(n3 ) time, and experiments show that on several benchmark datasets it can outperform state of the art graph kernels. Given a graph G , the two main lines of research that have emerged to address the above problem focus respectively on (a) designing an explicit feature mapping G (q1 , q2 , . . . , qk ); and (b) designing a kernel k (G1 , G2 ). Proponents of the first approach exploit global invariant properties of G , such as the eigenvalues of its graph Laplacian, or local invariant properties, such as the number of occurrences in G of a library of small subgraphs. In contrast, proponents of the kernel approach use various intuitions about simultaneous random walks and diffusion on product graphs (G¨rtner, 2003). a The new method that we present in this paper belongs in the first of the above two categories, but is distinguished from prior work (with the exception of (ShaweTaylor, 1993)) by its algebraic character. In this regard, it is related to the recent line of papers (Kondor et al., 2007; Huang et al., 2008; Kondor, 2007a) introducing concepts from non-commutative harmonic analysis to machine learning. The mathematical foundations of our work are Kakarala's seminal results on the bispectra of functions on compact groups (Kakarala, 1993; Kakarala, 1992), and the recent discovery of a unitarily equivalent, but computationally more attractive set of invariants called the skew spectrum (Kondor, 2007b). We show how these general theories can be harnessed to construct graph invariants, and examine in detail their computational properties. Experiments on standard datasets of chemical compounds show that the skew spectrum of graphs is competitive with the state of the art in graph features, and in some cases outperforms all other methods. A major advantage of the skew spectrum is that since it is an explicit feature mapping, it can be applied as a preprocessing step, and hence scales linearly with the number of examples. The computational complexity of computing the (reduced) skew spectrum of a single graph of n nodes scales with n3 . Uniquely amongst the graph invariants used in machine learning, the skew spectrum has a fixed number of scalar components (85 1. Intro duction After real valued vectors and strings, the third most fundamental type of data instance in machine learning are graphs. In addition to application domains such as bioinformatics (Sharan & Ideker, 2006), chemoinformatics (Bonchev & Rouvray, 1991), social networks (Kumar et al., 2006), etc., where information is presented as a graph from the start, graphs are also used to capture the relationships between the different parts of segmented images in computer vision (Harchaoui & Bach, 2007), and to capture grammatical structure in language (Collins & Duffy, 2002). Graphs may be directed or undirected, weighted or unweighted, and their vertices may be labeled, partially labeled or unlabeled. In each of these cases, the challenge is to represent graphs in a way that preserves their structure, but is insensitive to spurious transformations, such as changing the (arbitrary) numbering of their vertices. Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). The Skew Sp ectrum of Graphs for the complete skew spectrum and 49 for its reduced version), resulting in a very compact representation. This does not stop the skew spectrum form remaining competitive both in speed and representational accuracy up to about n = 300. For those technical details of the skew spectrum which could not be squeezed into this conference paper we refer the reader to the accompanying report (Kondor, 2008). Note that this is a very special type of function on Sn in that it is constant on each block of permutations Si,j = { Sn | (n) = i, (n - 1) = j } . (3) For k < n, identifying Sk with the subgroup of permutations permuting 1, 2, . . . , k amongst themselves and leaving k + 1, . . . , n fixed, the above blocks, of which there are n(n - 1) in total, each have the form Sn-2 = { | Sn-2 }, and are called left Sn-2 ­ cosets. Defining f as in (2) ensures that under relabeling it transforms in a transparent fashion. Specifically, if f is the function corresponding to A , then f ( ) = A)(n),()(n-1) = A(n),(n-1) = f ( ). ( (4) In general, a function g : Sn R related to f by g ( ) = f ( -1 ) is called the left-translate of f by , and is denoted f . Equation 4 tells us that f = f , reducing the problem of constructing graph invariants to finding left-translation invariant features of functions on Sn . 2.2. Invariant Matrices Now consider the weighted sum of matrices f = f ( ) ( ), Sn 2. Graph Invariants In this paper G will be a directed weighted graph of n vertices. We represent G by its adjacency matrix A Rn×n , where [A]i,j R is the weight of the edge from vertex i to vertex j . Unweighted graphs are special cases satisfying [A]i,j {0, 1}, while undirected graphs are special cases satisfying A = A. We assume that A has no self-loops, i.e., [A]i,i = 0 for i = 1, 2, . . . , n. Recall that a p ermutation of n ob jects is a bijective map : {1, 2, . . . , n} {1, 2, . . . , n}. Permuting the labels on the vertices of G by results in a new adjacency matrix A with entries [A ](i),(j ) = [A]i,j , (1) but A and A both represent the same graph G . A function q (A) is called a graph invariant if it is invariant to relabelings of this kind, i.e., if q (A) = q (A ) for any permutation . Our ob jective is to construct a system (q1 , q2 , . . . , qk ) of graph invariants which capture as much information about G as possible, yet can be computed economically. 2.1. Reduction to Left-translation Invariance Our approach is based on the fact that permutations form a group. This means that if for a pair of permutations 1 and 2 , we define their product 3 = 2 1 by composition of maps, i.e., 3 (i) = 2 (1 (i)), then the following axioms are satisfied: 1. for any two permutations 1 and 2 , the product 2 1 is also a permutation; 2. for any three permutations 1 , 2 and 3 , 1 (2 3 ) = (1 2 )3 ; 3. The identity e(i) = i is a permutation; 4. For any permutation , there is an inverse permutation -1 satisfying -1 = -1 = e. The group of permutations of n ob jects is called the symmetric group over n letters and is denoted Sn . To find graph invariants we begin by mapping A to a function f : Sn R, defined as f ( ) = A(n),(n-1) . (2) (5) where ( ) is a system of complex valued matrices satisfying (2 1 ) = (2 ) (1 ) 1 , 2 Sn , as well as the unitarity condition ( -1 ) = (( ))-1 = ( ) . Such systems of matrices are called unitary matrix representations of Sn . Changing variables from to = -1 shows that f = f ( ) ( ) f ( -1 ) ( ) = Sn = S n S n which suggests that (5) is a good starting point for constructing left-translation invariants of f . For ex ample, the matrix a = f · f is invariant because a = f · f = (( )f ) (( )f ) = f ( ) ( ) ( ) = ( ) f , The question we face is how to construct such invariants in a systematic way with minimum redundancy, yet maximum representational power. f ( ) () f = f · f = a . (6) The Skew Sp ectrum of Graphs 3. Irreps and the Fourier Transform It is easy to see that if 1 : Sn Cd×d is a unitary representation of Sn , and T is any d × d unitary matrix, then 2 ( ) = T 1 ( ) T is also a unitary representation. Such pairs of representations are said to be equivalent. Once we have computed (5) with = 1 , computing it again with = 2 will not lead to additional invariants, since f2 = T f1 T . (n) (n - 1, 1) (n - 2, 2) d 1 n-1 n(n-3) 2 (n-1)(n-2) 2 n(n-1)(n-5) 6 n(n-2)(n-4) 3 (n - 2, 1, 1) (n - 3, 3) (n - 3, 2, 1) Another potential source of redundancy is reducibility. A representation is said to be reducible if for some unitary T it splits in the form T 1 ( ) Sn ( ) = T 2 ( ) Table 1. The dimensionalities of some representations of Sn . The diagrams are drawn as if n = 8, but the formulae hold for general n. into a direct sum of smaller representations 1 and 2 . Once again, f does not supply any information on top of f1 and f2 because f = T (f1 f2 )T . To avoid these redundancies we will use a complete set of inequivalent irreducible unitary representations (irreps for short). Such a set we denote by R. The corresponding set of matrices f = f ( ) ( ), R, (7) Sn Young diagrams can also be described by listing the number of boxes in each row, for example, the above diagram is = (5, 2, 1). For concreteness, when we need to draw Young diagrams we will always depict them as if n = 8. Bijectively filling the boxes of a Young diagram with the numbers 1, 2, . . . , n gives a Young tableau, and if a tableau satisfies the condition that in each row the numbers increase from left to right and in each column they increase from top to bottom it is called a standard tableau. For example, 13458 26 7 is called the Fourier transform of f , and it provides the basis for generalizing harmonic analysis to non-commutative groups (Diaconis, 1988; Rockmore, 1997). Just as the classical Fourier transforms, F : f (f )R satisfies a generalized form of the translation and convolution theorems. What is most crucial for our present purposes, however, is that (given the appropriate inner products) F is unitary, and therefore one­to­one: hence, no information is lost in going from f to the set of matrices (f )R . Several different systems of irreps for Sn are described in the literature (James & Kerber, 1981). In the interests of saving space, we only describe their general scheme, without going into the details of how to compute the actual representation matrices. In all the major representation schemes the individual irreps R are indexed by Young diagrams, which are n boxes arranged in consecutive left-aligned rows satisfying the condition that no row overhangs the row above it. For example, (8) is a standard tableau of shape (5, 2, 1). The significance of standard tableaux is that they label the individual dimensions of the irrep of the same shape. Hence, we can find the dimensionality of by counting the number of possible standard tableaux of shape (Figure 1). An interesting special property of the symmetric group is that all the irreps can be chosen to be real valued. For generality, we retain the complex notation, but note that the actual system of irreps used in our experiments is real, so we could substitute "orthogonal" for "unitary" and for throughout. 4. The Bisp ectrum and the Skew Sp ectrum Armed with the irreps and non-commutative Fourier transforms, we can now undertake a more systematic study of left-translation invariant features of functions on the symmetric group. For example, (6) leads to the set of invariant matrices a = f · f , is a valid Young diagram for n = 8. We will use the letter to refer to Young diagrams and write n to denote that is a Young diagram with n boxes. To simplify notation somewhat we write f for f . n, which, by analogy with the analogous quantity in classical signal processing, is called the p ower sp ectrum The Skew Sp ectrum of Graphs of f . The problem with the power spectrum is that it is very lossy. To see this, one need only con sider f = M f for any sequence of unitary matrices (M )n . The functions f and f corresponding to these two Fourier transforms may be very different, yet their power spectrum will be the same. 4.1. The Bisp ectrum Kakarala realized that the lossiness of the power spectrum can be addressed by forming tensor products of the various Fourier components, and proposed the alternative system of invariant matrices b , = (f f ) C , f 1 , 2 n (9) 1 2 2 1 1 2 called the bisp ectrum (Kakarala, 1993)1 . The bispectrum is based on the observation that f1 f2 transforms according to f f = ( ( ) ( )) · (f f ), 1 2 2 1 2 1 is defined as the collection of matrices q, = r, · f , n, Sn , (11) where (r, )n is the Fourier transform of the function r ( ) = f ( ) f ( ). In (Kondor, 2007b) it is shown that if for some subgroup H , f is constant on left H cosets (as the function defined in (2) is constant on left Sn-2 -cosets), then it is sufficient to let take on just one value from each H H = { h 1 h 2 | h 1 , h2 H } double-coset, since every other component of q will be linearly dependent on these. 5. The Skew Sp ectrum of Graphs By the results of Sections 2 and 4, plugging (2) into (11) will give a relabeling invariant representation of any weighted graph G . As it stands, however, this seems of only academic interest, since must extend over n! different values for any one of which the combined size of the (q, )n matrices is itself n!. Moreover, computing each (q, )n requires a separate Fourier transform. The first clue to how these problems may be remedied is provided by the comment at the end of the last section that if we are only interested in linearly independent invariants, then due to the special structure of f , we need only let take on one value from each Sn-2 Sn-2 double coset. It is easy to see that there are only 7 such double cosets in Sn , namely S nn n-1n-1 = { Sn | (n) = n, (n - 1) = n - 1 } S nn-1 = { Sn | (n) = n - 1, (n - 1) = n } n-1n S nn = { Sn | (n) = n, (n - 1) [n - 2] } n-1 S nn-1 = { Sn | (n) = n - 1, (n - 1) [n - 2] } n-1 S nn-1 = { Sn | (n) [n - 2], (n - 1) = n - 1 } n-1 S nn = { Sn | (n) [n - 2], (n - 1) = n } n-1 S n = { Sn | (n), (n - 1) [n - 2] } , n-1 (12) and that 1( ) 2( ) is also a representation, although in general not irreducible. The general formula C 1( ) 2( ) = C1 ,2 (10) ( ) 1 ,2 telling us how to reduce it into a direct sum of irreps is called the Clebsch-Gordan decomposition, and the C1 ,2 unitary matrices appearing in (10) and (9) are called Clebsch-Gordan matrices. By plugging (10) into (9) it is easy to see that the bispectrum is indeed invariant to left-translation. A much more remarkable fact, proved in (Kakarala, 1992), is that provided the technical condition that each f is invertible is satisfied, the bispectrum is also complete (or lossless) in the sense that the matrices (b1 ,2 )1 ,2 n uniquely determine f up to translation. 4.2. The Skew Sp ectrum Some of the drawbacks of using the bispectrum in practical applications are that (a) computing (9) may involve multiplying together very large matrices; (b) that the Clebsch-Gordan matrices, despite being universal constants, are not generally available in tabuated form; and (c) that for large n they are extremely difficult to compute. To address these concerns, Kondor (2007b) proposed an alternative set of invariants, called the skew sp ectrum, which are unitarily equivalent to the bispectrum, but much more straightforward to compute. The skew spectrum of f : Sn C The exact definition of the bispectrum varies somewhat between authors. However, the various definitions are all unitarily equivalent to each other. 1 where [n - 2] = {1, 2, . . . , n - 2}. Definition 1 Given a graph G of n vertices and adjacency matrix A, the skew spectrum of G is defined as the col lection of matrices q, = r, · f , n, (13) where r ( ) = f ( ) f ( ); f is defined as in (2); and takes on one value from each of the double cosets listed in (12). The second important consequence of the form of (2) is that using the right system of irreps, f becomes very The Skew Sp ectrum of Graphs sparse. To be specific, we use Young's orthonormal representation (YOR), which has the special property that if is restricted to Sn-1 , then the ( ) matrices block-diagonalize in the form - ( ), Sn-1 , ( ) = - where - extends over all valid Young diagrams derivable from by the removal of a single box. If the pair of standard tableaux t and t feature n at the same box, then [ ( )]t,t = [- ( )]tn-1 ,tn-1 where t n-1 is the standard tableau that we get from t by removing the box containing n and - is the corresponding Young diagram. If t and t feature n at different locations, then [ ( )]t,t = 0. Applying this relation recursively gives that for Sk , [ - ( )]tk ,tk or (14) [ ( )]t,t = 0 depending on whether k + 1, . . . , n are each in the same boxes in t and t or not. Now letting Sn /Sn-2 be a set of n(n - 1) permutations, one from each Sn-2 coset, and defining h : Sn-2 C as h ( ) = f ( ), the Fourier transform may be written as f = f ( ) ( ) ( ) = ( ) Sn /Sn-2 unitarity all other components of h vanish. Plugging this result into (15) and using (14) shows that only those columns of f may be non-zero which are indexed by adding by standard tableau derivable from a box containing n - 1 and another box containing n. Here and in the following, when drawing standard tableau, we only indicate the positions of those numbers in them that are not determined by the "numbers increase from left to right and top to bottom" rule. In addition, we use the symbol to denote n and · to denote n - 1. We summarize the above in the following theorem. Theorem 1 If f is defined as in (2), then the only non-zero entries of f in YOR are: 1. the single scalar component f(n) ; 2. the 3. the 4. the 5. the · · · c c olumn of f(n-2,2) ; olumn of f(n-1,1) ; column of f(n-1,1) ; c This remarkable sparsity is the key to computing the skew spectrum of graphs efficiently. At the same time it is rather disappointing, since it manifestly destroys the invertibility of the f matrices required for Kakarala's completeness result. The r, matrices are also column sparse, but their sparsity pattern is somewhat more complicated, so we leave describing it to (Kondor, 2008). Equation (13) only yields non-zero elements in q, where a non-zero row of r, meets a non-zero column of f . By the above, this happens at only a constant number of row/column combinations. The exact result, derived in (Kondor, 2008), is the following. Theorem 2 Using YOR and an appropriate choice of { } double coset representatives, the skew spectrum of G has at most 85 non-zero scalar components. olumn of f(n-2,1,1) . Sn /Sn-2 Sn-2 h ( ) ( ). Sn-2 Plugging in the appropriate decomposition of into a direct sum of irreps of Sn-2 gives f = ( ) Sn /Sn-2 h ( ) Sn-2 - ( ) = h - ( ) Sn /Sn-2 - , (15) - 6. Computational Considerations The computational properties of the skew spectrum are closely related to the structural results of the previous section. In particular, it is repeated applications of Clausen decompositions similar to (15) together with the sparsity of YOR that yields an efficient algorithm to compute q. In contrast to the previous section, we now employ a two-level factorization = 1 2 , where Sn-2 , 2 Sn-1 /Sn-2 , and 1 Sn /Sn-1 . As before, we have n(n - 1) functions h1 2 : Sn-2 C defined h1 2 ( ) = f (1 2 ), and by (2) each of these showing that the Fourier transform over Sn may be broken down into n(n - 1) Fourier transforms over Sn-2 . This relationship is at the heart of the Clausentype fast Fourier transforms for Sn (Clausen, 1989). For f defined by (2), each h is a constant function, and hence its Fourier transform has a very special form: since in YOR the irrep corresponding to = (n) is the constant representation (n) ( ) = (1), the corh responding component will be non-zero, but by The Skew Sp ectrum of Graphs is a constant function equal to [A]1 2 (n), 1 2 (n-1) . However, now we will also have intermediate functions g1 : Sn-1 C defined g1 ( ) = f (1 ). We then have the following results. Lemma 1 Each g1 can be computed from A in O(n2 ) scalar operations. Since each h1 2 is confined to the one dimensional component [h1 2 ](n-2) , the only non-zero columns of g1 will be the ones indexed by standard tableaux by addition of the single box derivable from · , namely · and · . The first one of these is trivial to compute, since (n-1) (2 ) (1), collapsing the above sum to h ( [g1 ](n-1) = 1 2 n-2) . 2 Sn-1 /Sn-2 Pro of. Similarly to (15), we can relate the Fourier transform of g1 to the Fourier transforms of (h1 2 )2 by h [g1 ] = (2 ) 1 2 - . 2 Sn-1 /Sn-2 - is in preparation, will show that the time complexity of this is O(n6 ). While for n less than about 20 this might still be feasible, for the type of experiments on which we wish to validate the skew spectrum it is not a viable option. The following subsection shows that most of the components of q can still be computed in O(n3 ) operations. 6.1. The Reduced Skew Sp ectrum The expensive part of computing r is computing those columns outside the five listed in Theorem 1. This leads to the idea of simply forcing these columns to be zero. Definition 2 Given a graph G of n vertices and adjacency matrix A, the reduced skew spectrum of G is the col lection of matrices q, = r, · f , n, (16) This is a sum of n - 1 scalars, so it can be computed in O(n) time. Computing the second component in h volves taking the direct sum M1 2 = - 1 2 - , where - extends over the two diagrams (n - 2) and by removing a box. (n - 3, 1) derivable from h ( However, 1 2 n-3,1) = 0, so M1 2 has only one non-zero entry. For given 2 , multiplying (n-2,1) (2 ) with M1 2 thus requires n - 2 operations. We are summing over (n - 1) possible values of 2 , so the total time complexity is (n - 1)(n - 2). L emma 2 f can be computed from the intermediate transforms (g1 )1 Sn /Sn-1 in O(n3 ) operations. ,· , . (17) , ,· Since r is identical to r except for zeroing out certain columns, (q ) will yield a subset of the 85 scalar invariants in (q ) . For each value of , for = (n) we have one row of r meeting one column of f giving one component; for = (n - 1, 1) we have two rows meeting two columns, giving four components, etc. In total the reduced skew spectrum has 7 (1 + 4 + 1 + 1) = 49 non-zero scalar components. where f ,r, and are as in Definition 1, and r denotes the projection of r to its columns labeled by · The proof of Lemma 2 is similar to that of Lemma 1, but also involves considerations of the sparsity of the YOR matrices. Unfortunately, space limitations prevent us from providing a proof of this result. Putting the two lemmas together gives the following theorem. Theorem 3 The Fourier transform of f as defined in (2) can be computed in O(n3 ) operations. The space of functions the Fourier transform of which has the sparsity pattern (16) is exactly the space of functions which are invariant on Sn-2 cosets. This means that for each r there must be a corresponding matrix B related to it the same way that f is related to the adjacency matrix A. These matrices are given by the following theorem, the proof of which we again relegate to a longer publication. Theorem 4 For r as defined in Definition 2, r ( ) = [B ](n),(n-1) , where the seven possible B matrices corresponding to the seven double cosets listed in (12) are [B1 ]i,j = Ai,j Ai,j [B2 ]i,j = Ai,j Aj,i [B3 ]i,j = [B4 ]i,j = 1 n Ai,j 1 n Ai,j omputing r is unfortunately more costly than computing f . An extended version of this paper, which Pro of. Each of the n different g transforms can be computed in O(n2 ) operations, followed by the single C (n3 ) step of computing f from the g's. O n n i =1 j =1 Ai ,j Ai,j The Skew Sp ectrum of Graphs [B5 ]i,j = [B6 ]i,j = [B7 ]i,j = 1 i =1 Ai ,j n Aj,i n 1 j =1 Ai,j n Aj,i n 1 i =1 n(n-1) Ai,j n n j =1 Ai ,j Theorem 4 tells us that the reduced skew spectrum is very simple to compute: simply form the matrices B1 , . . . , B7 , compute the corresponding r the same f is computed from A and form the products way as (16). In total this takes 8 partial Fourier transforms, each of which takes O(n3 ) time. 7. Exp eriments In our experiments we evaluate the performance of the skew spectrum features on four benchmark datasets of chemical structures of molecules: MUTAG, ENZYMES, NCI1, and NCI109. MUTAG (Debnath et al., 1991) is a dataset of 188 mutagenic aromatic and heteroaromatic nitro compounds. The classification task is to predict for each molecule whether it exerts a mutagenic effect on the Gram-negative bacterium Salmonel la typhimurium. ENZYMES is a dataset which we obtained from (Borgwardt et al., 2005), and which consists of 600 enzymes from the BRENDA enzyme database (Schomburg et al., 2004). In this case the task is to correctly assign each enzyme to one of the 6 EC top level classes. The average number of nodes of the graphs in this dataset is 32.6 and the average number of edges is 124.3. Finally, we also conducted experiments on two balanced subsets of NCI1 and NCI109, which classify compounds based on whether or not they are active in an anti-cancer screen ((Wale & Karypis, 2006) and http://pubchem.ncbi.nlm.nih.gov). Since in these datasets the number of vertices varies from graph to graph, we set n to be the maximum over the entire dataset and augment each of the smaller graphs with the appropriate number of unconnected "phantom" nodes. The experiments consisted of running SVMs on the above data using the reduced skew spectrum features (linear kernel on these features), the random walk kernel (G¨rtner et al., 2003), (with set a to 10-3 on MUTAG/ENZYMES, and 10-4 on the NCI datasets for optimal performance), and an equal length shortest-path kernel (Borgwardt & Kriegel, 2005). Our experimental procedure was as follows. We split each dataset into 10 folds of identical sizes. We then split 9 of these folds again into 10 parts, trained a C-SVM (implemented by LIBSVM (Chang & Lin, 2001)) on 9 parts, and predicted on the 10th part. We repeated this training and prediction procedure for C {10-7 , 10-6 , . . . , 107 }, and determined the C reaching maximum prediction accuracy on the 10th part. We then trained an SVM with this best C on all 9 folds (= 10 parts), and predicted on the 10th fold, which acts as an independent evaluation set. We repeated the whole procedure 10 times so that each fold acts as independent evaluation set exactly once. For each dataset and each method, we repeat the whole experiment 10 times and report mean accuracy levels and standard errors in Table 2. In three out of four experiments the skew spectrum beats the other methods, including the shortest-path kernel, which is considered state of the art for graphs of this type. Using a Gaussian RBF kernel instead of the linear kernel yields very similar results. 8. Conclusions We have presented a new system of graph invariants, called the skew spectrum of graphs, based on a purely algebraic technique. From a mathematical point of view the skew spectrum is interesting because it brings a fundamentally new technique to constructing graph invariants. From a practical machine learning point of view the skew spectrum is interesting because it provides a powerful, yet efficiently computable representation for graph structured data instances. Acknowledgments We would like to thank Ramakrishna Kakarala for providing us with a hard copy of his thesis. We would also like to thank Dan Rockmore, Tony Jebara, Rocco Servedio, Maria Chudnovsky and Bal´zs Szendri for a o discussions and the anonymous reviewers for helpful comments. References Bonchev, D., & Rouvray, D. H. (Eds.). (1991). Chemical graph theory: Introduction and fundamentals, vol. 1. London, UK: Gordon and Breach Science Publishers. Borgwardt, K. M., & Kriegel, H.-P. (2005). Shortestpath kernels on graphs. Proc. Intl. Conf. Data Mining (pp. 74­81). Borgwardt, K. M., Ong, C. S., Schonauer, S., Vishwanathan, S. V. N., Smola, A. J., & Kriegel, H. P. (2005). Protein function prediction via graph kernels. Bioinformatics, 21, i47­i56. Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm. The Skew Sp ectrum of Graphs Number of instances/classes Max. number of nodes Reduced skew spectrum Random walk kernel Shortest-path kernel MUTAG 188/2 28 88.61 (0.21) 71.89 (0.66) 81.28 (0.45) ENZYME 600/6 126 25.83 (0.34) 14.97 (0.28) 27.53 (0.29) NCI1 4110/2 111 62.72 (0.05) 51.30 (0.23) 61.66 (0.10) NCI109 4127/2 111 62.62 (0.03) 53.11 (0.11) 62.35 (0.13) Table 2. Prediction accuracy in percent of the (reduced) skew spectrum features and state of the art graph kernels on four classification benchmarks in 10 repetitions of 10-fold cross-validation. Standard errors are indicated in parentheses. Best results for each datasets are in bold. Clausen, M. (1989). Fast generalized Fourier transforms. Theor. Comput. Sci., 55­63. Collins, M., & Duffy, N. (2002). Convolution kernels for natural language. Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press. Debnath, A. K., Lopez de Compadre, R. L., Debnath, G., Shusterman, A. J., & Hansch, C. (1991). Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. J Med Chem, 34, 786­797. Diaconis, P. (1988). Group representation in probability and statistics, vol. 11 of IMS Lecture Series. Institute of Mathematical Statistics. G¨rtner, T. (2003). A survey of kernels for structured a data. SIGKDD Explorations, 5, 49­58. G¨rtner, T., Flach, P., & Wrobel, S. (2003). On graph a kernels: Hardness results and efficient alternatives. COLT03 (pp. 129­143). Springer. Harchaoui, Z., & Bach, F. (2007). Image classification with segmentation graph kernels. Proceedings of CVPR07. Huang, J., Guestrin, C., & Guibas, L. (2008). Efficient inference for distributions on permutations. Proceedings of NIPS07. James, G., & Kerber, A. (1981). The representation theory of the symmetric group. Addison-Wesley. Kakarala, R. (1992). Triple corelation on groups. Doctoral dissertation, Department of Mathematics, UC Irvine. Kakarala, R. (1993). A group theoretic approach to the triple correlation. IEEE Workshop on higher order statistics (pp. 28­32). Kondor, R. (2007a). A novel set of rotationally and translationally invariant features for images based on the non-commutative bispectrum. http://arxiv.org/abs/cs.CV/0701127. Kondor, R. (2007b). The skew spectrum of functions on finite groups and their homogeneous spaces. http://arxiv.org/abs/0712.4259. Kondor, R. (2008). The skew spectrum of graphs. To appear at http://arxiv.org/. Kondor, R., Howard, A., & Jebara, T. (2007). Multiob ject tracking with representations of the symmetric group. Proceedings of the Eleventh International Conference on Artificial Intel ligence and Statistics. Kumar, R., Novak, J., & Tomkins, A. (2006). Structure and evolution of online social networks. KDD (pp. 611­617). Rockmore, D. N. (1997). Some applications of generalized FFTs. Proceedings of the DIMACS workshop on groups and computation. Schomburg, I., Chang, A., Ebeling, C., Gremse, M., Heldt, C., Huhn, G., & Schomburg, D. (2004). Brenda, the enzyme database: updates and ma jor new developments. Nucleic Acids Research, 32D, 431­433. Sharan, R., & Ideker, T. (2006). Modeling cellular machinery through biological network comparison. Nature Biotechnology, 24, 427­433. Shawe-Taylor, J. (1993). Symmetries and discriminability in feedforward network architectures. IEEE Transactions on Neural Networks, 4, 816­826. Wale, N., & Karypis, G. (2006). Comparison of descriptor spaces for chemical compound retrieval and classification. Proc. of ICDM (pp. 678­689). Hong Kong.