Clique Matrices for Statistical Graph Decomp osition and Parameterising Restricted Positive Definite Matrices David Barb er Department of Computer Science, University College London www.cs.ucl.ac.uk/staff/d.barber Abstract We introduce Clique Matrices as an alternative representation of undirected graphs, being a generalisation of the incidence matrix representation. Here we use clique matrices to decompose a graph into a set of possibly overlapping clusters, defined as wellconnected subsets of vertices. The decomposition is based on a statistical description which encourages clusters to be well connected and few in number. Inference is carried out using a variational approximation. Clique matrices also play a natural role in parameterising positive definite matrices under zero constraints on elements of the matrix. We show that clique matrices can parameterise all positive definite matrices restricted according to a decomposable graph and form a structured Factor Analysis approximation in the non-decomposable case. 1 3 2 4 1 5 2 4 3 (a) (b) Figure 1: Two simple undirected graphs that only a small number of links in an `almost clique' are missing, this may be considered a sufficiently wellconnected group of nodes to form a cluster. We will therefore develop a statistical technique to reveal clusters of nodes and to identify the smallest number of such clusters. Our main contribution is the introduction of the clique matrix formalism, a generalisation of the incidence matrix. We apply this to the clustering problem, in addition to demonstrating an application in constrained covariance parameterisation. 2 1 Intro duction Clique Decomp osition Undirected graphs may be used to represent connectivity or adjacency structures in data. For example, in Collaborative Filtering, the nodes(vertices) in the graph may represent products, and a link(edge) between nodes i and j could be used to indicate that customers who by product i frequently also buy product j . This paper concerns decomposing the graph into well-connected clusters of nodes1 . In Fig.1a product 3 is typically bought along with products 1 and 2, or with products 4 and 5, though these two productgroups are otherwise disjoint. A formal specification of the problem of finding a minimum number of well-connected subsets is to phrase this as MIN CLIQUE COVER[9, 17]. However, in some applications, provided 1 The symmetric adjacency matrix has elements Aij {0, 1}, with a 1 indicating a link between nodes i and j . For the graph in Fig.1b, the adjacency matrix is 1 0 1 1 1 1 0 1 1 1 1 1 A= 1 1 1 1 (1) where we include self connections on the diagonal. Given a graph G with adjacency matrix A, our aim is to find a `simpler' description of A that reveals underlying cluster structure. 2.1 Two-Clique Decomp osition Not to be confused with graph-partitioning. Given the undirected graph in Fig.1b, the incidence matrix Zinc is an alternative description of the adjacency structure[6]. Given the V nodes in the graph, we construct Zinc as follows: For each link ij in the graph, form a column of the matrix Zinc with zero entries except for a 1 in the ith and j th row. The column ordering is arbitrary. For example, on the left 1 , 2 i Zinc = 1 0 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 T Zinc Zinc = 1 1 0 1 3 1 1 1 1 3 1 0 1 1 2 For each k , the nodes i and j corresponding to 1 s in the k th column of Z give a product Zik Zj k = 1. Since this happens for every non-zero i and j pair in the k th column, all of the pair connections give rise to a non-zero product. In other words, all the nodes corresponding to non-zero elements of the k th column are connected to each other, thus forming a clique. The interpretation of thei elements of Z Z T is that ZT diagonal elements Z i express the number of cliques/columns thaZ vertex i occurs in. t Offi diagonal elements Z T j contain the number of cliques/columns that vertices i and j jointly inhabit. Whilst finding a clique decomposition Z is easy (use the incidence matrix for example), finding a clique decomposition with the minimal number of columns, i.e. solving MIN CLIQUE COVER, is NP-Hard[9]. One approach would be to use a recursive procedure that searches for maximal cliques in the graph or related techniques based on finding large densely connected subgraphs[17]. The route that we take here is different and motivated by the idea that perfect clique decomposition is not necessarily desirable if the aim is only to find well-connected clusters in G. s an incidence matrix for the graph in Fig.1b. Taking the outer-product with itself, on the right, we see that2 Z ( T A = H inc Zinc 2) where [H (M )]ij = 1 if Mij > 0 and is 0 otherwise (i.e. H (·) is the element-wise Heaviside step function). A useful viewpoint of the incidence matrix is that it identifies two-cliques in the graph (here we are using the term `clique' in the non-maximal sense). There are five 2-cliques in Fig.1b, and each column of Zinc specifies which elements are in each 2-clique. 2.2 Clique matrices The incidence matrix can be generalised to describe larger cliques. Consider the following matrix as a decomposition for Fig.1b, and its outer-product: 1 0 0 1 1 0 1 1 1 1 0 1 3 Statistical Clique Decomp ositions 1 1 Z = 1 1 , 1 2 2 1 Z Z T = 1 2 2 1 (3) The interpretation is that Z represents a decomposition into two 3-cliques. As in the incidence matrix, each column represents a clique, and the rows containing a `1' express which elements are in the clique defined by that column. Both Zinc and Z satisfy Z = Z ( T A = H ZT H inc Zinc 4) for Fig.1b. For clustering, Z is to be preferred against the incidence decomposition, since Z decomposes the graph into a smaller number of larger cliques. Indeed, Z solves MIN CLIQUE COVER for Fig.1b. Definition 1 (Clique Matrix). Given an adjacency matrix [A]ij , i, j = 1, . . . , V (Aii = 1), a clique matrix Z has elements Zi,c {0, 1} , i = 1, . . . , V , c = 1, . . . C such that A = H (Z Z T ). w To find `well-connected' clusters, we relax the constraint that the decomposition is in the form of cliques in the original graph. Our approach is to view the absence of links as statistical fluctuations away from a perfect clique. Given a V × C matrix Z , we desire that the higher the overlap between rows3 zi and zj is, the greater the probability of a link between i and j . This may be achieved using, for example, zT ( p(i j |Z ) = i zj 6) where we define 1 -1 (x) + e (0.5-x) (7) A Clique Matrix Z {0, 1} is minimal for A if V ×C there exists no other clique matrix for Z {0, 1} < ith a smaller number of columns C C. V ×C That each column of Z expresses a clique is clear from k Z Ti Z j= Zik Zj k (5) 2 and controls the steepness of the function. The 0.5 shift in Eq. (7) ensures that approximates the stepfunction, since the argument of is an integer. Under Eq. (6), if zi and zj have at least one `1' in the same T position, zi zj - 0.5 > 0 and p(i j ) is high. Absent links contribute p(i j |Z ) = 1 - p(i j |Z ). controls how strictly (Z Z T ) matches A; for large , very little flexibility is allowed and only cliques will be identified. For small , subsets that would be cliques if it were not for a small number of missing links, are clustered together. The setting of is user and problem dependent. 3 (·)T represents matrix transpose. We use lower indices zi to denote the the ith row of Z . Given Z , and assuming each element of the adjacency matrix is sampled independently from the generating process, the joint probability of observing A is (neglecting its diagonal elements), T p(A|Z ) = i j constant is l i j c og l i j q zic zj,c 1 c - zic zj,c q (11) z T i zj i j 1 - z T i zj + og he ultimate quantity of interest is the posterior, p(Z |A) p(A|Z )p(Z ) (8) The first term of Eq. (11) encourages graph links to be preserved under the decomposition, and is given by fC C d i zid zj d (12) j =1 e=1 where p(Z ) is a prior over clique matrices. Later we place a prior on Z to encourage the smallest number of clusters to be identified (and hence for the size of the clusters to be large). However, since finding such Z , even in the case of a fixed desired number of clusters, C , is hard, we develop an algorithm to approximately discover clique matrices, before discussing non-uniform priors p(Z ). q (zie )q (zj e ) where f (x) log (x). Minimising Eq. (10) can be achieved by differentiation. Differentiating the energy contribution from the present links, Eq. (12) with respect to q (zkc ) we identify two cases: when i = k and when j = k . Due to symmetry, the derivative is f 2 k j d zkd zj d e (Q) q (zj e ) g =c q (zkg ) 4 Finding Z for a fixed cluster numb er (13) Similarly, the derivative of the absent-links energy is f 2 k j d Formally, our task is to find the Most likely A Posteriori (MAP) solution arg maxZ p(A|Z ) where Z is a V × C binary matrix. A variety of deterministic and randomised methods could be brought to bear on this problem. The approach we take here is to approximate the marginal posterior p(zij |A) and then to assign each zij to that state which maximises this posterior marginal (MPM). This has the advantage of being closely related to marginal likelihood computations, which will prove useful later for addressing the issue of finding the number of clusters. Here we develop a straightforward variational approach based on a simple factorised approximation to the posterior. 4.1 Mean Field Approximation e zkd zj d q (zj e ) g =c (Q) q (zkg ) (14) where f (x) log (1 - (x)). Equating the derivative of Eq. (10) to zero, a fixed point condition for each qk,c k = 1, . . . , V ,c = 1, . . . , C is q (zkc ) e(Q)+ ( Q) (15) Given the intractable p(Z |A) p(A|Z ), a fully factorised mean-field approximation (see, e.g. [20]) i V cC =1 =1 q (Z ) = q (zi,c ) (9) A difficulty here is that neither (Q) nor (Q) are easy to compute, due to the non-linearities. dA simple Gaussian Field approximation[3] assumes zkd zj d is Gaussian distributed for a fixed state of zi,c . In thid case, we need to find the mean and variance of s zkd zj d . Writing ab q (zab = 1), and using the independence of q , the mean is given by d µkj = zkc j c + k d j d =c can be found by minimising the KL divergence K L(q , p) = log q q - log p q (10) A similar expression is easily obtained for the variance 2 kj . The Gaussian Field approximation then becomes, q (zkc ) e 2 jk f (x)+ j k f (x) where · q represents expectation with respect to q . i The first `entropic' term simply decomposes into ,c log q (zi,c ) . The second, `energy' term, up to a N (x|µkj ,2 ) kj (16) where the one dimensional averages are performed numerically. By evaluating Eq. (16) for the two states 8 7 20 50 50 6 5 100 100 4 60 3 150 150 2 1 200 50 100 150 200 200 200 400 600 800 1000 0 100 1 2 3 4 5 6 7 8 9 10 11 12 20 40 60 80 100 100 20 40 60 80 100 120 140 100 20 40 60 80 100 80 80 80 60 60 40 40 40 20 20 (a) (b) (c) (a) (b) (c) Figure 2: (a) Adjacency matrix for the DIMACS brock200-2 MAX-CLIQUE challenge. Black denotes the presence of a link. (b) Clique Matrix. (c) Log2 histogram of clique occurrence (+1); correctly solves MAX-CLIQUE (12) as well as identifying all remaining clusters. of zkc (and noting that the mean and variance of the field depends on these states), the approximate update for kc is obtained. A simpler alternative is to assume that the variance of the field is zero, and approximate the averages by evaluating the functions at the mean of the field. We found that this latter procedure often gives satisfactory performance and therefore used this simpler and faster approach in the experiments. One epoch corresponds to updating all the kc , k = 1, . . . , V , c = 1 . . . , C . During each epoch the order in which the parameters are updated is chosen randomly. Figure 3: (a) Adjacency matrix of 105 Political Books (black=1). (b) Clique matrix: 521 non-zero entries. (c) Adjacency reconstruction using an Approximate Clique Matrix with 10 cliques ­ see also Fig. 4. that a small number of components should be active, we set a = 1, b = 3. Through Eq. (17), the prior on thus induces a prior on Z . The resulting distribution p(Z, |A) p(Z |)p() is formally intractable. 5.1 Variational Bayes To deal with the intractable joint posterior we adopt a similar strategy to the fixed C case and employ a variational procedure to seek a factorised approximation p(, Z |A) q ()q (Z ) based on minimising K L(q ()q (Z ), p(, Z |A)) q (Z ) up dates A fixed point condition for the optimum of Eq. (20) is (20) 5 Finding the numb er of clusters To bias the contributions to A to occur from a small number of columns of Z , we first reparameterize Z as ( Z = 1 z 1 , . . . , Cmax z Cmax 17) where c {0, 1} play the role of indicators and z is the vector of column c of Z . Cmax is an assumed maximal number of clusters we desire to find. Ideally, we would like to find a likely solution Z with a low number of indicators 1 , . . . , Cmax in state 1. To achieve this we define a prior on 4 , c I [ =0] p(| ) = I [c =1] (1 - ) c (18) To encourage a small number of s to be used, we use a Beta prior p( ). This gives rise to a Beta-Bernoulli distribution B (a + N , b + Cmax - N ) p() = p(| )p( ) = B (a, b) (19) where B (a, b) is the normalisation constant of the betaCm distribution. N c=1ax I [c = 1], namely the number of indicators in state 1. To encourage strongly 4 q (Z ) e log p(A|Z,) q() elog p(A|Z, ) (21) c The average over q () in Eq. (21) in the first expression is complex to carry out and we simply approximate at the average value of the distribution. This reduces the problem to one similar to that of inferring Z for a fixed C , as in Section 4.1. We therefore make the same assumption that q (Z ) factorizes according to Eq. (9). This gives updates of the form Eq. (16) where has been set to its mean value. q () up dates A fixed point condition for the optimum of Eq. (20) is q () p()e log p(A|Z,) q(Z ) , c q (c ). The Additionally we assume that q () = resulting update q (c ) e log p(A|Z,) q(Z ) + log p() d =c q (d ) I [x = y ] is 1 if x = y and 0 otherwise. is difficult to compute and we take the naive approach of replacing averages by evaluation at the mean ) ) q (c ) p(c , \c p(A| z , c , \c (22) Figure 4: Political Books. Plotted is the 105 × 10 matrix Z , found by approximating the 105 × 105 adjacency co-bought matrix, where a dot indicates q (zi,c ) > 0.5. By inspection, cliques 5,6,7,8,9 largely correspond to `conservative' books. Green indicates `conservative' books, yellow `neutral' and red `liberal' books. Since c is binary, we can easily find Eq. (22) by evaluating at its two states. Formally, the prior p() requires {0, 1} . However, in the above approximation, the mean is nonc binary. To deal cwith this, we replace I [c = 1] by c c c and I [c = 0] by Cmax - c . Since the expressions are valid for non-integer sums, this approximate procedure remains well defined. The algorithm then updates q () and q (Z ) until convergence. The effect is that, beginning with Cmax clusters, under the updating, the posterior assigns 's not required to state zero. C 6 DIMACS MAX-CLIQUE Our method aims to find a complete characterisation of an undirected graph into constituent clusters. By setting suitably high ( = 10 in the experiments), we impose that perfect cliques constitute clusters. In Fig.2a we show the adjacency matrix for a 200 vertex graph, taken from the DIMACS 1996 MAX CLIQUE challenge [4]. This graph was constructed to hide the largest clique in the graph and make it difficult to find based on the recursive algorithms of the time. Whilst more recent algorithms have been constructed that readily find the largest clique in this graph [13], this problem serves as an interesting baseline to see if our algorithm, in searching for a complete decomposition, also solves MAX-CLIQUE for this graph. Running our Mean-Field algorithm with Cmax = 2000 results in a clique-decomposition, Fig.2b, containing 1102 cliques5 . In Fig.2c we plot a log histogram of the cluster sizes, indicating that there is only a single largest clique of size 12. The largest clique in the 5 1000 Years for Revenge Bush vs. the Beltway Charlie Wilson's War Losing Bin Laden Sleeping With the Devil The Man Who Warned America Why America Slept Ghost Wars A National Party No More Bush Country Dereliction of Duty Legacy Off with Their Heads Persecution Rumsfeld's War Breakdown Betrayal Shut Up and Sing Meant To Be The Right Man Ten Minutes from Normal Hillary's Scheme The French Betrayal of America Tales from the Left Coast Hating America The Third Terrorist Endgame Spin Sisters All the Shah's Men Dangerous Dimplomacy The Price of Loyalty House of Bush, House of Saud The Death of Right and Wrong Useful Idiots The O'Reilly Factor Let Freedom Ring Those Who Trespass Bias Slander The Savage Nation Deliver Us from Evil Give Me a Break The Enemy Within The Real America Who's Looking Out for You? The Official Handbook Vast Right Wing Conspiracy Power Plays Arrogance The Perfect Wife The Bushes Things Worth Fighting For Surprise, Security, the American Experience Allies Why Courage Matters Hollywood Interrupted Fighting Back We Will Prevail The Faith of George W Bush Rise of the Vulcans Downsize This! Stupid White Men Rush Limbaugh Is a Big Fat Idiot The Best Democracy Money Can Buy The Culture of Fear America Unbound The Choice The Great Unraveling Rogue Nation Soft Power Colossus The Sorrows of Empire Against All Enemies American Dynasty Big Lies The Lies of George W. Bush Worse Than Watergate Plan of Attack Bush at War The New Pearl Harbor Bushwomen The Bubble of American Supremacy Living History The Politics of Truth Fanatics and Fools Bushwhacked Disarming Iraq Lies and the Lying Liars Who Tell Them MoveOn's 50 Ways to Love Your Country The Buying of the President 2004 Perfectly Legal Hegemony or Survival The Exception to the Rulers Freethinkers Had Enough? It's Still the Economy, Stupid! We're Right They're Wrong What Liberal Media? The Clinton Wars Weapons of Mass Deception Dude, Where's My Country? Thieves in High Places Shrub Buck Up Suck Up The Future of Freedom Empire graph is indeed 12[4]. Political Bo oks Clustering The data consists of 105 books on US politics sold by the online bookseller Amazon. Edges in graph G, Fig.3a, represent frequent co-purchasing of books by the same buyers, as indicated by the `customers who bought this book also bought these other books' feature on Amazon[10]. Additionally, books are labelled `liberal', `neutral', or `conservative' according to the judgement of a politically astute reader6 . Running our algorithm with an initial Cmax = 200 cliques, the posterior contains 142 cliques7 , Fig.3b, giving a perfect reconstruction of the adjacency A. For comparison, the incidence matrix has 441 2-cliques. To cluster the data more aggressively, we fix C = 10 and run our algorithm. As expected, this results only in an approximate clique decomposition, A H (Z Z T ), as plotted in Fig.3c. The resulting 105 × 10 approximate clique matrix is plotted in Fig. 4 and demonstrates how individual books are present in more than one cluster. Interestingly, the clusters found only on the basis of the adjacency matrix have some correspondence with the ascribed political leanings of each book. Demonstrations 7 Latent Parameterisations for Zero-Constrained Positive Matrices We may use an undirected graph G to represent zero constraints on a positive definite matrix K . In particular, missing edges in G with adjacency Aij = 0, correspond to zero entries Kij = 08 . An example apSee www-personal.umich.edu/mejn/netdata/. This take roughly 10s on a 1GHz machine. 8 In a Gaussian context, missing edges in G typically correspond to missing edges in the inverse covariance. Much of our initial discussion relates only to constraining positive 7 6 This takes roughly 30s using a 1Ghz machine. 0 5 10 plication would be to fit a Gaussian to data under the constraint that specified elements of the covariance are zero. In such cases, it is useful to have a parameterisation of the allowed space of covariances. We denote the space of positive definite matrices constrained through G by M + (G). Our approach is based on the simple observation that by replacing non-zero entries of a clique matrix Z with arbitrary real values, Z Z , the maT trix Z (Z ) is positive (semi) definite. An immediate question is the richness of such a parameterisation ­ can all of M + (G) be reached in this way? 7.1 Decomp osable Case 2 1 3 4 a 2 b 600 1 4 c 400 200 d 3 0 0 0.05 0.1 (a) (b) (c) Figure 5: (a) Non-decomposable graph. (b) Correlations can be induced via latent variables.(c) Histogram of the rms errors in approximating covariances according to graph (a) with an expanded incidence matrix. factor T T . For the example for G in Fig.1b, the lower triangular Cholesky factor is9 0 0 0 0 For G decomposable, parameterising M + (G) is straightforward[14, 12, 19]. For example one may appeal to the following: Theorem 1 (Paulsen et al., 1989). The fol lowing are equivalent for an undirected graph G: (i) the graph is decomposable; (ii) there exists a permutation of the vertices such that with respect to this renumbering every K M + (G) factors as K = T T T with T M (G) and T upper triangular. For decomposable G, provided the vertices are perfect elimination ordered, the Cholesky factor has the same structure as G[19]. In other words, provided the vertices are ordered correctly, the lower triangular part of the adjacency matrix is a clique matrix and furthermore parameterises all of M + (G). All positive definite matrices under decomposable zero-constraints can therefore be parameterised by some clique matrix. Definition 2 (Expanded Clique Matrix). Given a V ×C Clique Matrix Z {0, 1} , the Expanded Clique matrix consists of Z appended with columns corresponding to all unique sub-columns of Z . A subcolumn of z c is defined by replacing one or more entries c c containing zi = 1 by zi = 0. The expanded Clique Matrix corresponding to the minimal clique matrix derived from Fig.1b is 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 which corresponds to columns 1, 2, 7, 11 of the expanded clique matrix, Eq. (23). Clearly, in general, the expanded clique matrix is an overparameterisation of M + (G) for decomposable G. 7.2 Non-decomp osable Case For G non-decomposable, no explicit parameterisation is generally possible and techniques based on Positive Definite matrix completion are required[12, 18, 5, 14]. For the specific example in Fig.5a, the lower Cholesky factor has the form c11 0 0 0 c21 c22 0 0 c31 c32 c33 0 , with c21 c31 + c22 c32 = 0 0 c42 c43 c44 (24) which can be found explicitly in this case. However, more generally, for non-decomposable graphs, one cannot identify those elements of the Cholesky factor which may be set to zero, with the remainder determined by the positivity requirement[14, 19]. An alternative is to use latent variables to explicitly parameterise M + (G). One may use Factor Analysis[1] x=F , N (0, I ) = FFT 0 1 1 1 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 (23) In the above, the expansion is ordered such that all 3-cliques are enumerated, then all 2-cliques and finally all 1-cliques. Starting from a minimal clique matrix for a decomposable graph, the expansion of this minimal clique matrix must contain all the columns of the Cholesky definite matrices and is independent of its application. where the factor matrix F is suitably structured in order to force zeros in specific elements of 10 . 9 In general, for a matrix with elements dij {0, 1}, we use D to denote a matrix with dj = 0 if dij = 0, and i arbitrary values elsewhere. 10 ~ By writing F = [F |D] where D is diagonal, this is explicitly Factor Analysis. Unlike standard FA, the matrix ~ F will typically be non-square and sparse. A special case of the above is to use a latent variable to induce correlation between x1 and x2 via a local Directed Graph element x1 12 x2 . For each edge in G, a corresponding latent can thus be introduced to form correlations between all pairs of variables, without introducing correlations on missing edges in G[8]. By taking F = [Zinc |I ], it is clear that this latent variable approach (see, for example, [16]) is reproduced and is a special case of restricting Cliques to Incidence matrices. To show that not all of M + (G) can be reached by clique matrices, consider Fig.5a. In this particularly simple case, the minimal clique matrix is the same as the incidence matrix, and the expanded clique matrix is simply the incidence matrix with the identity matrix appended. In this case, therefore, the expanded clique matrix contains columns with only two non-zero entries. However, the Cholesky factor Eq. (24) contains columns with 3 non-zero entries, so that there is no immediate assignment of [Zinc |I ] which will match the Cholesky factor. 2 1 5 3 4 the minimal For the non-decomposable graph clique matrix contains 3-cliques so that its expansion contains columns that an expansion based on an incidence matrix would not. In this case our approximate parameterisation is therefore richer than would be obtained from simply introducing a latent auxiliary variable for each edge of the graph[8, 15]. 7.3 Maximum Likeliho o d Solution where Z should be chosen as large as can be computationally afforded. Z can be determined by running the algorithm of Section 4.111 . Although for non-decomposable G, not all of M + (G) is guaranteed reachable through this parameterisation, one may expect that numerically this may be sufficiently close. A benefit of this approach is that one may then minimize Eq. (25) with respect to the free parameters of Z using any standard optimisation technique, and convergence is guaranteed. Since our parameterisation has a natural latent variable representation (it is a form of structured Factor Analysis), EM and Bayesian techniques can also be used in this case. A numerical example is plotted in Fig.5c. We take the 4 × 8 expanded clique matrix corresponding to Fig.5a and minimise Eq. (25) with respect to the non-zero entries of the clique matrix12 . Each sample matrix S is generated randomly by drawing values of the Cholesky factor Eq. (24) independently from a zero mean unit variance Gaussian. In Fig.5c we plot the root mean square error between the learned and sample covariance S , averaged over all non-zero components of . The histogram of the error, computed from 1000 simulations shows that, whilst a few have appreciable error, the vast ma jority of cases are numerically well approximated by the expanded clique matrix technique, even though the graph G is non-decomposable. 8 Summary In fitting a Gaussian N (0, ) to zero mean data, with sample covariance S , the ML solution minimises () Tr - 1 S + log det (25) Our interest is to minimize sub ject to zero constraints on specified through G, with ij = 0 if Aij = 0. For G decomposable, the problem is essentially trivial, since M + (G) is easily characterized via a structured Cholesky factor, C T ()C () see for example [14], for which one can parameterise Eq. (25) using () and perform unconstrained minimisation over the free parameters of the Cholesky factor. In the non-decomposable G case, no explicit parameterisation of M + (G) is feasible. A common approach in thi case = to recognisei that solutions to this sats is -1 -1 isfy S -1 j for Aij = 1 and ij = ij otherwise[2] and define iterative procedures to solve this equation[7]. Alternatively, Positive Definite Completion methods may be used to parameterise M + (G). T Our approach uses the parameterisation = Z (Z ) We introduced a graph matrix decomposition based on an extension of the incidence matrix concept. Finding the clique decomposition corresponding to the smallest number of cliques is a hard problem, and we considered a relaxed version of the problem to find an approximate clique decomposition based on a variational algorithm. The approach can be seen as a form of binary factorisation of the adjacency matrix13 . Clear extensions of this work would be to consider alternative approximate inference schemes, including sampling methods, for which `infinite' extensions are also available[11]. An application of clique matrices is to 11 A heuristic is to initialise Z for the variational algorithm based on the lower Cholesky factor of a spanning decomposable graph, augmented with missing two cliques. Once the algorithm converges to an approximate minimal clique matrix, its expansion is used to form the parameterisation Z . 12 We chose this simple case since the exact parameterisation of all M + (G) is easy to write down. Whilst here the expanded clique and incidence matrices are equivalent, the reader should bear in mind that in more complex situations, the expansion based on a clique matrix provides a richer parameterisation than that of the incidence matrix. 13 A parallel development to our own work is [11], which considers binary factorisation of more general matrices. Thanks to Zoubin Ghahramani for pointing me to this. parameterising positive definite matrices under specified zero constraints. We showed that constraints corresponding to decomposable graphs trivially admit a clique matrix representation, and how our structured Factor Analysis technique can be used to approximate the non-decomposable case. This is a richer parameterisation than those latent models which consider only pairwise correlations in forming the latent model. Indeed, the so-called ancillary variable technique is a special case of using incidence, as opposed to clique matrices. The latent variable formulation additionally offers an alternative to recent works on conjugate priors for constrained covariances in Bayesian learning. C-code for clique matrices is available from the author. Acknowledgements I thank R. Silva, T. Cemgil and M. Herbster for discussions. Supported in part by the European PASCAL Network of Excellence, IST-2002-506778. Statistical and Applied Mathematical Sciences Institute, 2005. [9] M. R. Garey and D. S. Johnson. Computers and Intractability, A Guide to the Theory of NPCompleteness. W.H. Freeman and Company, New York, 1979. [10] V. Krebs. Political books data. www.orgnet.com. [11] E. Meeds, Z. Ghahramani, R. Neal, and S. Roweis. Modeling dyadic data with binary latent factors. Neural Information Processing Systems, 19:977­984, 2008. [12] V. I. Paulsen, S. C. Power, and R. R. Smith. Schur products and matrix completions. Journal of Functional Analysis, 85:151­178, 1989. [13] W. J. Pullan and H. H. Hoos. Dynamic local search for the maximum clique problem. Journal of Artificial Intel ligence Research (JAIR), 25:159­185, 2006. [14] A. Roverato. Hyper Inverse Wishart Distribution for Non-decomposable Graphs and its Application to Bayesian Inference for Gaussian Graphical Models. Scandanavian Journal of Statistics, 29:391­411, 2002. [15] R. Silva, W. Chu, and Z. Ghahramani. Hidden Common Cause Relations in Relational Learning. Neural Information Processing Systems, 2007. [16] R. Silva and Z. Ghahramani. Bayesian Inference for Gaussian Mixed Graph Models. Uncertainty in Artificial Intel ligence, 2006. [17] S. S. Skiena. The algorithm design manual. Springer-Verlag, New York, USA, 1998. [18] T. P. Speed and H. Kiiveri. Gaussian markov distributions over finite graphs. Annals of Statistics, 14:138­150, 1986. [19] N. Wermuth. Linear Recursive Equations, Covariance Selection, and Path Analysis. Journal of the American Statistical Association, 75(372):963­ 972, 1980. [20] W. Wiegerinck. Variational approximations between mean field theory and the junction tree algorithm. Uncertainty in Artificial Intel ligence, 16:626­633, 2000. References [1] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley-Interscience, 3rd edition, 2003. [2] T. W. Anderson and I. Olkin. Maximumlikelihood estimation of the parameters of a multivariate normal distribution. Linear Algebra and its Applications, 70:147­171, 1985. [3] D. Barber and P. Sollich. Gaussian Fields for Approximate Inference in Layered Sigmoid Belief Networks. In Advances in Neural Information Processing Systems (NIPS). MIT Press, 2000. [4] M. Brockington and J. Culberson. Camouflaging Independent Sets in Quasi-Random Graphs. In D. S. Johnson, editor, Cliques, Coloring, and Satisfiability: Second DIMACS Implementation Chal lenge, volume 26 of DIMACS. American Mathematical Society, 1996. [5] S. Chaudhuri, M. Drton, and T S. Richardon. Estimation of a covariance matrix with zeros. Biometrika, 94:199­216, 2007. [6] R. Diestel. Graph Theory. Springer, 2005. [7] M. Drton and T. Richarson. A New Algorithm for Maximum Likelihood Estimation in Gaussian Graphical Models for Marginal Independence. Uncertainty in Artificial Intel ligence, 2003. [8] D. B. Dunsen, J. Palomo, and K. Bollen. Bayesian Structural Equation Modeling. SAMSI 2005-5,