Sparse Gaussian Graphical Models with Unknown Block Structure

Benjamin M. Marlin bmarlin@cs.ubc.ca Kevin P. Murphy murphyk@cs.ubc.ca Department of Computer Science, University of British Columbia, 2366 Main Mall, Vancouver, Canada

Abstract
Recent work has shown that one can learn the structure of Gaussian Graphical Models by imposing an L1 penalty on the precision matrix, and then using efficient convex optimization methods to find the penalized maximum likelihood estimate. This is similar to performing MAP estimation with a prior that prefers sparse graphs. In this paper, we use the stochastic block model as a prior. This prefer graphs that are blockwise sparse, but unlike previous work, it does not require that the blocks or groups be specified a priori. The resulting problem is no longer convex, but we devise an efficient variational Bayes algorithm to solve it. We show that our method has better test set likelihood on two different datasets (motion capture and gene expression) compared to independent L1, and can match the performance of group L1 using manually created groups.

^  = S +I, where   0 can be chosen by cross validation or the Ledoit-Wolf formula (Ledoit & Wolf, 2004). An alternative is to form a regularized estimate of the precision matrix,  = -1 (also called the concentration matrix). A particularly useful approach is based on penalizing the L1 norm of the elements of , to encourage sparsity in the precision matrix; the resulting objective function is convex and can be optimized by a variety of methods (Banerjee et al., 2006; Banerjee et al., 2008; Friedman et al., 2007; Yuan & Lin, 2007; Duchi et al., 2008; Schmidt et al., 2009). Zeros in the precision matrix correspond to absent edges in the corresponding Gaussian graphical model (GGM), so this penalty can be interpreted as preferring graphs that are sparse, that is, which have few edges. However, this approach is different from standard model selection methods for GGMs, such as (Drton & Perlman, 2004), which estimate the graph structure but not the parameters. For some kinds of data, it is reasonable to assume that the variables can be clustered or grouped into types, which share similar connectivity or correlation patterns. For example, genes can be grouped into pathways, and connections within a pathway might be more likely than connections between pathways. If the group structure is known, one can extend the above L1 penalized likelihood framework in a straightforward way, by penalizing the infinity norm (Duchi et al., 2008) or the two-norm (Schmidt et al., 2009) of each block separately; the resulting objective function is still convex, and encourages blockwise sparse graphs. In this paper, we present a method that estimates sparse block-structured precision matrices, when the block structure is unknown. We first describe related work in Section 2, and then describe our method in Section 3. In Section 4, we apply our method to two different datasets: the gene expression data used in (Duchi et al., 2008), and a motion capture dataset. Our method outperforms Tikhonov regularization and independent L1 regularization. More interestingly, it

1. Introduction
Estimating a covariance matrix  from high dimensional data using a small number of samples is known to be statistically challenging, and yet it is a problem that arises frequently in practice. In the case where the number of samples N is less than the number of dimensions D, the sample covariance matrix S, which is equal to the MLE, is not positive definite. But even when N > D, the eigenstructure of the MLE tends to be distorted unless D/N is very small (see e.g., (Dempster, 1972)). There have been many different attempts to devise regularized estimates of . A very simple approach, which we shall call Tikhonov regularization, is to use
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Sparse GGMs with Unknown Block Structure

can achieve performance that approaches that of the known grouping in both data sets.

definite. The objective function is given by 1 1 log det() - 2 2N
N D D

2. Related work
One approach to learning sparse GGMs it to form a modified Cholesky decomposition of the precision matrix, -1 = B T DB, where B = I - W , W is a lower triangular matrix of regression weights, xj = ľj + i<j wij (xi - ľi ), and D is a diagonal matrix of variances, and then to impose an L1 penalty (Huang et al., 2006), or a spike-and-slab prior (Smith & Kohn, 2002), on the elements of W . Note that this is equivalent to learning a sparse directed acyclic graph (Shachter & Kenley, 1989), and requires knowing a total ordering of the nodes. This limits the usefulness of the technique for unordered, or cross-sectional, data. A natural generalization of the sparse DAG approach (0) is to regress each node on all the others, xj = wj + i=j wij xi , and then to impose an L1 prior on each row of W and optimize the penalized pseudolikelihood, given by
N 2 p(xnj |xn,-j , wj , j ) n=1 D

xT xn - n
n=1

1 2 i=1 j=1 D D

ij |ij |

 log det() - tr(S) -
i=1 j=1 N

ij |ij | (1)

1 where S = N n=1 xn xT is the empirical covariance n matrix, assuming the data has been mean-centered for notational simplicity. This is a convex objective function, and various efficient algorithms (typically O(D3 ) time complexity) have been proposed to solve it (Friedman et al., 2007; Rothman et al., 2008; Duchi et al., 2008; Schmidt et al., 2009). Typically we use a different penalization strength for the diagonal and offdiagonal terms: D D D

log det() - tr(S) - 
i=1 j=i

|ij | - 
i=1

|ii | (2)

Note that setting  = 0 is equivalent to Tikhonov regularization, since ii > 0 implies
D

+
j=1

||wj,: ||1

tr((S + I)) = tr(S) + 
i=1

|ii |

where ||wj,: ||1 = i{1,...,d}\{j} |wji | is the one-norm of the weight vector into node j. One can then infer the undirected graph structure from the non-zero elements of W . This technique is a consistent estimator of the graph topology under certain conditions, analogous to those that ensure lasso is model selection consistent for variable selection (Meinshausen & Buhlmann, 2006). We shall refer to this as the "MB" technique. The MB technique is very similar to the dependency net approach proposed in (Heckerman et al., 2000), except we use L1 penalized linear regression rather than decision trees to infer the graph. Given the graph, regression weights can be estimated for each node and assembled into a precision-like matrix. However, this matrix is not necessarily positive definite for small sample sizes, and thus not necessarily a valid precision matrix . One way around this is to compute an MLE of  subject to the estimated set of structural zeros using the IPF algorithm (Speed & Kiiveri, 1986) or other convex optimization methods (Dahl et al., 2008). A technique that can estimate  and achieve sparsity at the same time was independently proposed by (Yuan & Lin, 2007) and (Banerjee et al., 2006). In this approach, we impose an L1 penalty on the elements of the precision matrix, and maximize the penalized log likelihood, subject to the constraint that  be positive

In the case that we have known groups, denoted by Sg , we can extend this objective as follows: log det() - tr(S) -
g

g ||{ij : (i, j)  Sg }||p (3)

where p specifies which norm to apply to the elements in each group. Note that this objective is still convex. If we use p = , as in (Duchi et al., 2008), the penalty has the form maxi,jSg |ij |. If we use p = 2, as in (Schmidt et al., 2009), the penalty has the 2 form i,jSg ij . This tends to work better, since it forces a block to be sparse when all the elements within "want" to be small, rather than having the behavior of the group be dominated by the largest element. In the limit where each edge is its own group, the group L1 objective reduces to the original L1 objective. Learning sparse GGMs with unknown block structure is much harder, for reasons we explain in Section 3, and we are not aware of any prior work on this problem. However, there has been work on learning sparse DAG models with unknown block structure. Specifically, (Mansinghka et al., 2006) showed how one can use the stochastic blocks model (Nowicki & Snijders, 2001) as a prior over graphs, combined with a uniform prior over node orderings, to learn block structured DAGs from discrete data. Our work is related to

Sparse GGMs with Unknown Block Structure

(Mansinghka et al., 2006), but has the following key differences: we learn undirected graphs rather than DAGs, we learn Gaussian models rather than multinomial models, we use a finite mixture model rather than an infinite one, we use a pseudo likelihood rather than a likelihood when inferring the clustering (but not when estimating ), we use variational Bayes instead of MCMC, and we apply our method to real data rather than synthetic data.

could use a Monte Carlo estimate, as in (Lenkoski & Dobra, 2008), but this would be very expensive, since it would have to be recomputed every time  changes. By working with the pseudolikelihood, we can solve a series of related sparse regression problems without worrying about the positive definite constraint. Once we have identified suitable clusters in the data, we then optimize Equation 3. In the following sections, we describe our method in more detail. 3.2. Model We define our model by explaining the generative process in a top-down fashion. Steps 1-4 define the stochastic block model on undirected graphs (Nowicki & Snijders, 2001). We assume the number of groups K is fixed, and discuss how to estimate this below. The remaining steps explain how we use the graph as a prior for the weights, which are then used to generate the data. The overall model is illustrated in Figure 1(left). 1. We sample the fraction of nodes in each group from a symmetric Dirichlet distribution,    Dir( K ). 2. For each variable d, we sample a group membership using a multinomial distribution, zd  Multi(, 1), i.e., p(zd = k|) = k . 3. For each pair of groups k, k  , we sample the probability of an edge between them, k,k  Beta(ak,k , bk,k ). 4. For each pair of distinct variables d, d , we sample an edge between them according to a Bernoulli, Gd,d  Ber(zd ,zd ). This results in a symmetric undirected graph.
2 2 5. We sample 0 from a Gamma prior, 0  Ga(, ) 2 2 and set 1 = 0 as described below.

3. Method
3.1. Overview We propose a two stage method for learning sparse GGMs with unknown blocks. In the first stage, we optimize a pseudolikelihood criterion, as in the MB method, combined with a sparsity promoting prior on the weights. The sparsity level of each edge weight, Wij , is controlled by the clusters to which nodes i and j belong, as well as the probability of an edge between these cluster types. Having identified the clusters, we then estimate  using the block L1 method. We give the details of our method below, but first we motivate why we adopted this two stage approach. It is well known that Lasso (L1 regularized linear regression) is equivalent to MAP estimation with a Laplace prior. By analogy, we can see that L1 penalization of each element of the precision matrix corresponds to an independent Laplace prior on each element of , normalized over the space of positive definite matrices P in D dimensions. More precisely, optimizing Equation 1 is equivalent to MAP estimation with the following prior, where the indicator function I[  P] ensures that the prior is zero for precision matrices that are not positive definite.
D D

I[  P] p(|) =
d=1 d d D D

dd exp(-dd |dd |) dd exp(-dd |dd |)d

P d=1 d d

6. For each pair of nodes 1  d  D, d = d, we sample an edge weight according to
2 2 wd,d  N (0, 0 )Gd,d N (0, 1 )1-Gd,d

If the prior parameters dd are fixed, the normalization term is constant with respect to the precision matrix and can be ignored, leading to an efficiently solvable convex optimization problem. If the prior parameters dd are themselves endowed with a structured hierarchical prior, and we wish to adapt these parameters at the same time as we fit the covariance matrix , the normalization term in the Equation above varies and is intractable to compute exactly, unless we restrict attention to the class of decomposable graphical models (see e.g., (Rajarratnam et al., 2008)). One

This is similar to the standard "spike and slab" prior used in the Bayesian variable selection literature (George & McCulloch, 1997). 7. Finally we generate the data via a series of independent linear regressions:
T xd,n  N (wd x-d,n ,  2 )

where x-d,n is the n'th training vector with the d'th component removed and  2 is the overall

Sparse GGMs with Unknown Block Structure

  a b

ak,k  ak,k +


d=1 d =1

D D X X

d,k d ,k d,d

 +


D(D - 1) 2
2 D X X wd,d d,d ( + (1 - d,d )) 2  d=1  d =d   

bk,k  bk,k +


D

K


d=1 d =1  

D D X X

d,k d ,k (1 - d,d )

 +


Z

K

G
 0   2 1

D

 k,k ,1  (ak,k ) - (ak,k + bk,k )  k,k ,0  (bk,k ) - (ak,k + bk,k ) 0 1 K K X X X   d ,k (d,d  k,k ,1 + (1 - d,d ) k,k ,0 ) + (k ) - ( dk  softmax @ k )A
d =d k =1 K X K X k=1

d,d

 logistic @

0

d,k d ,k ( k,k ,1 -  k,k ,0 ) +

1 2(

k1 k 1

1 1 2 2 - 2 )(wd,d + wd ,d ) - 2 0 1

1 2

log A

1

W

D

k  k + d  diag


,,

dk d=1 k=1  d,d + (  
N X T

K D XX

(1 - d,d ))
-1

Ť Xdn X-dn )

X

N

wd  ( d + ^

2

X-dn X-dn )

(

n=1

n=1

N X

Figure 1. Summary of the model (left) and the variational updates (right). On the left, small square nodes are fixed hyperparameters. The double-ringed 1 node is a deterministic function of 0 and . Only the X matrix is observed, all other quantities are inferred.

noise level. (Note that this is just an interpretation of the pseudolikelihood, rather than a proper generative mechanism for the data.) 3.3. Hyperparameters The model has several hyperparameters which we now discuss. The key hyperparameters that we have identified are the number of clusters K, the weight variance scale parameter , and the data noise variance  2 . The number of clusters K is determined in the learning algorithm using the model free energy and explicit cluster splitting steps as described in Section 3.5. Preliminary cross validation experiments showed that setting  = 0.01 and  2 = 0.1 led to good performance in terms of the final precision matrix estimate on both of the data sets we consider, and these settings are used throughout the experiments. We set the Beta hyperparameters on  to ak,k = 2, bk,k = 1 if k  = k, and ak,k = 1, bk,k = 2 if k = k  . This encodes a weak prior favoring a graph structure with more edges between nodes in the same cluster than between nodes in different clusters. We set the weight variance shape parameter to  = 1 and note that it is overwhelmed in the posterior update by the D(D - 1)/2 factor as described in Section 3.5. We set the Dirichlet parameter  = 1/K.
2 The 1 parameter is defined to be a constant multiple 2 2  of 0 . This construction for 0 and 1 is important 2 since its value (in conjunction with  ) determines the

sparsity of the inferred graph. Intuitively, we determine if an edge Gd,d is present or absent by classifying its corresponding weight wd,d under the two Gaus2 2 sian distributions N (0, 0 ) and N (0, 1 ). We want to choose the variances so that this decision boundary occurs at some "reasonable" distance from the origin to avoid turning on edges if their weights are too small. 2 We choose  such that the two pdfs, N (0, 0 ) and 2 N (0, 1 ), intersect at c0 , i.e., we solve the following for  2 2 N (c0 |0, 0 ) = N (c0 |0, 0 ) which yields  = exp LW - c2 + c2 exp(c2 )

where LW is the Lambert W function (the inverse of f (x) = xex ). We require c > 1, otherwise the broad 2 2 N (0, 1 ) pdf will not dominate N (0, 0 ) in the tails. We chose c = 2, which results in a fairly sharp decision boundary, and hence a low entropy posterior on G. 2 2 (Other approaches for setting 0 and 1 are discussed in (George & McCulloch, 1997).) 3.4. Variational Bayes Approximation The full posterior is proportional to p(Z, , , G, W, 0 , X|K,  2 , , , , , a, b) We use variational Bayes (Ghahramani & Beal, 2000) to approximate the posterior. In particular, we use the

Sparse GGMs with Unknown Block Structure

following fully factorized approximation: Q(Z, , , G, W, 0 ) = Q(Z)Q()Q()Q(G)Q(W )Q(0 ) The individual variational distributions and corresponding variational parameters are as follows: Q(Zd ) = Multi(d , 1) Q() = Dir( ) Q(k,k ) = Beta(a  , b  ) k,k k,k Q(Gd,d ) = Ber(d,d )
2 Q(1/0 ) = Ga( ,   ) Q(W ) = (W - w) ^

The split proposal mechanism is an important component of the learning method. We propose a split for a given cluster k by deriving a similarity matrix H from the current estimate of the regression weights w ^ and applying spectral clustering to H to partition it into two clusters. More precisely, if S = {i : zi = k} is the set of nodes belonging to cluster k, and S are the other nodes, we compute the similarity matrix H = |w(S, S)| + 0.5|w(S, S)||w(S, S)T |. Intuitively, ^ ^ ^ two data dimensions d and d are similar under this measure if they are useful for predicting each other (the first term), and there is significant overlap in the subsets of dimensions outside of cluster k that are useful for predicting both d and d (the second term). 3.6. Estimating the precision matrix Once our main learning algorithm has run to completion, we compute the marginal MAP assignment of nodes to groups, and then use this known grouping structure as input to the group L1 method to infer the precision matrix. Specifically, we optimize Equation 3 (using the algorithm and code of (Schmidt et al., 2009)), where we use the groups Sk,k = {i, j : zi = k, zj = k  }. Following (Duchi et al., 2008), we set the penalty parameter for each group to g = |Sg |, where  is an overall scale parameter. This ensures that all the blocks have a comparable level of sparsity, regardless of their size. 3.7. Complexity and Fast Update Schedule The computational bottleneck in our method is solving the D independent ridge regression problems required to update the weight matrix w, each at a cost ^ of O((D - 1)3 ). This takes O(D4 ) time per iteration of the inner loop of the learning algorithm. However, by using intelligent scheduling of the updates, one can reduce this cost substantially. In particular, we only perform a full update of the w matrix every 10 itera^ tions. On the remaining iterations we sort the nodes by the L1 norm of the change in d and only update the top 10% of nodes. The intuition is that since the only quantity that is changing in the variational update for wd is d , we don't need to update nodes if ^ the change in d is small relative to the last time the node was updated. This fast update schedule gives roughly a 10-fold speed improvement, without significantly affecting the quality of the estimates (see Figure 2(a-d)). The fast update schedule leads to a total training time (including precision matrix estimation) of approximately 5 hours on current single CPU hardware for the genes data discussed in Section 4.2 where D = 667.

It is important that the quantities that change dimensionality with the number of clusters, namely  and , have non-degenerate distributions, otherwise we cannot use the free energy for model selection (see Section 3.5). In particular, if we perform MAP estimation of , there is nothing in the model to encourage a small number of clusters, but by putting a distribution on , we get an automatic form of complexity control (see e.g. (Bishop, 2006, p480) for discussion). Finally, note that we choose to use point estimation for W since representing the uncertainty in W using a full covariance Gaussian would be intractable (requiring O(D4 ) space), and using a diagonal approximation would not provide much benefit over just using a point estimate. 3.5. Learning We learn the model parameters by optimizing the free energy. The inner loop of the learning algorithm consists of updating the variational parameters of each Q distribution (and several auxiliary parameters) in turn until the free energy converges. The required parameter updates are given in Figure 1(right) and are derived following the usual freeform optimization method (see e.g. (Bishop, 2006, p466)). The strategy we use to select the number of clusters K consists of starting with all nodes in a single cluster (K = 1) and running the variational updates until the free energy converges to within a specified tolerance (1e - 6). Each time the variational updates converge for a given value of K we consider splitting each of the current clusters. We accept the first split that results in an increase in the variational free energy using a 20 iteration look-ahead. We then continue iterating the variational updates with K + 1 clusters. We terminate the algorithm if no split is found that increases the free energy, or the learning algorithm exceeds 1000 iterations of the variational updates.

Sparse GGMs with Unknown Block Structure
Relative Test Log Likelihood N=25 10 7 6 5 Relative Log Likelihood Relative Log Likelihood Relative Log Likelihood 4 3 2 1 0 -1 -2 -3
-4

Relative Test Log Likelihood N=50 5 4 3 2 1 0 -1 -2
-4 -2

Relative Test Log Likelihood N=75 6 5 Relative Log Likelihood 4 3 2 1 0 -1 -2 -3
-4

Relative Test Log Likelihood N=100

5

0

-5 10
-6

T IL1 KGL1 UGL1 UGL1F 10 Regularization Strength 10
-2

T IL1 KGL1 UGL1 UGL1F
-6

10

10 Regularization Strength

10

-3 -6 10

T IL1 KGL1 UGL1 UGL1F 10 Regularization Strength 10
-2

T IL1 KGL1 UGL1 UGL1F
-6

10

10 Regularization Strength

-4

10

-2

(a) N=25
1 0.9 10 Data Dimensions 20 30 40 0.3 50 60 1 2 3 Blocks 4 5 0.2 0.1 0 60 1 2 0.8 Data Dimensions 0.7 0.6 0.5 0.4 10 20 30 40

(b) N=50
1 0.9 0.8 0.7 Blocks 0.6 0.5 0.4 0.3 50 0.2 0.1 3 Blocks 4 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 1

(c) N=75
1 0.9 0.8 Data Dimension 0.7 0.6 0.5 0.4 0.3 0.2 0.1 2 3 Blocks 4 5 0 60 10 50 10 20 30 40

(d) N=100
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 20 30 40 50 Data Dimension 60 0

(e) Known Blocks

(f) Clustering E(z)

(g) Block connectivity E()

(h) Graph E(G)

Figure 2. Results on CMU mocap data (D = 60). Figures 2(a) to 2(d) show the test set loglikelihood (relative to Tikhonov) vs  of Tikhonov (T), Independent L1 (IL1), Known Group L1 (KGL1), Unknown Group L1 (UGL1) and its fast version (UGL1F) for training set sizes N = 25, 50, 75, 100. Figures 2(e) to 2(h) show the known block structure, the inferred block structure, the stochastic block model parameters, and the pairwise interaction probabilities for the first fold of the CMU data set with 50 training cases. The color axis in all four figures is scaled from zero (black) to one (white).

4. Experimental results
In this section, we apply our method to two different datasets, and compare its performance with three other methods: Tikhonov regularization, independent L1 regularization, and group L1 regularization with known groups. Since the datasets are small, we use 5-fold cross validation (CV), and compute the mean and standard error of the unpenalized log likelihood on each test fold. We also consider varying the size of the training set. All the data is preprocessed by standardizing it. To select the strength of the Tikhonov regularizer , we use 5-fold CV within each training fold. We use the same value of  for all the methods, and report performance relative to the Tikhonov baseline. Rather than picking  for the L1 methods, we plot performance vs , as in (Duchi et al., 2008), to better illustrate the differences between the methods. 4.1. Motion capture data The data set used in this section is based on the "Dance" data set in the CMU motion capture library (available at http://mocap.cs.cmu.edu/). This is

more expressive than simple walking sequences. The joint angles for all sequences were extracted from the raw motion capture files. The root position and orientation were kept fixed and combined with the remaining joint angles to obtain skeleton positions using the Matlab Motion Capture Toolbox. There are a total of 93 variables defined by the skeleton consisting of 31 markers (x, y, z) triples, however not all markers vary when the root node is fixed, and some, like thumb and front foot markers, are overly correlated with their parent parts in the skeleton. We use a subset of the markers including two markers for the head and neck, and four markers for the trunk, each arm, and each leg. This gives a total of 20 markers and 60 variables. Skeleton positions within the same motion capture sequence were selected by considering a threshold of 0.1 on the average distance between the current skeleton position and the previously selected skeleton position. This processing was performed to eliminate portions in the sequence where the motion capture subject is not moving. To enhance the block structure in the data set, the skeleton was manually partitioned into five parts: head and neck, left arm, right arm, left leg, and right leg. A

Sparse GGMs with Unknown Block Structure
Relative Test Log Likelihood N=139 10

4.2. Gene expression data In this section, we apply our method to the gene expression dataset used in (Duchi et al., 2008), which consists of 174 samples of 667 genes. In (Duchi et al., 2008), the GGM was estimated using a known grouping, where the genes were partitioned into 86 different groups based on prior biological knowledge. In Figure 3, we see that our method results in similar predictive performance to the method which uses the known grouping; and both grouping methods do better than independent L1 on average. However, we note that the latent structure inferred by our method (not shown) contains approximately 30 groups with no obvious relationship to the known structure based on 86 groups. Nevertheless, both groupings improve the regularized estimate of .

Relative Log Likelihood

5

0 T IL1 KGL1 UGL1F
-6

-5 10

10 Regularization Strength

-4

10

-2

Figure 3. Log likelihood vs  results on Gene data (N = 174, D = 667).

new data set was constructed where each data case is created by sampling parts independently from the empirical distribution over positions for that part. This is roughly equivalent to recording someone as they move their arms and legs independently. Figure 2 shows the results on this dataset. In the top row, we plot average test set log likelihood vs  for the different methods, for training set sizes of N  {25, 50, 75, 100}. (Recall that D = 60.) We draw the following conclusions from these graphs: all the L1 methods are better than Tikhonov for a reasonably broad range of ; the known groups method is the best; and our method for handling unknown groups achieves performance that is almost as good as known groups, especially for larger sample sizes. (Regular L1 penalized GGMs without grouping was previously applied to mocap data in (Gu et al., 2007), but no quantitative performance measures were given.) In the bottom row, we examine the model which was learned when N = 50. (Results are for one particular test fold; results are similar for the other folds.) In Figure 2(a), we plot the "true" clustering, corresponding to the 5 body parts. In Figure 2(b), we show the estimated clustering. We see that the method chose the correct number of clusters, and assigned most of the nodes to the correct groups, except for a few ambiguous points. (Note that due to label switching, Figure 2(b) need not look identical to Figure 2(a) even if the assignment is perfect.) In Figure 2(c) we plot the  matrix; we see that each cluster is fairly densely connected within itself, but there are essentially no connections between clusters. This is due to the way that the data was created. Finally, in Figure 2(d) we plot the weight matrix W . We see that there are some non-zero off-diagonal elements, even though the prior says that this is unlikely to occur.

5. Conclusions
We have shown how to learn sparse GGMs where the sparsity pattern has a block structure, which we estimate simultaneously with the graph itself. There are several possibilities for future work. One is to try to eliminate the two-step process, by replacing the depedendency network with a Cholesky decomposition of , which is always positive definite, perhaps combined with a search over node orderings, as in (Dobra et al., 2004). Another possibility is to consider discrete data. This introduces the usual computational difficulty of evaluating the likelihood and its gradient, but standard approximations exist for this task (Wainwright et al., 2007; Lee et al., 2006).

Acknowledgments
We would like to thank John Duchi for sharing data and answering questions about his method, Mark Schmidt for useful discussions and sharing code, and Francois Caron for help on an earlier version of this model.

References
Banerjee, O., Ghaoui, L. E., & d'Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. of Machine Learning Research, 9, 485­516. Banerjee, O., Ghaoui, L. E., d'Aspremont, A., & Natsoulis, G. (2006). Convex optimization techniques for fitting sparse gaussian graphical models. Intl. Conf. on Machine Learning (pp. 89­96).

Sparse GGMs with Unknown Block Structure

Bishop, C. (2006). Pattern recognition and machine learning. Springer. Dahl, J., Vandenberghe, L., & Roychowdhury, V. (2008). Covariance selection for non-chordal graphs via chordal embedding. Optimization Methods and Software, 23, 501­502. Dempster, A. (1972). Covariance selection. Biometrics, 28, 157­175. Dobra, D., Hans, C., Jones, B., Nevins, J., Yao, G., & West, M. (2004). Sparse graphical models for exploring gene expression data. J. Multivariate analysis, 90, 196­212. Drton, M., & Perlman, M. D. (2004). Model selection for Gaussian concentration graphs. Biometrika, 91, 591­602. Duchi, J., Gould, S., & Koller, D. (2008). Projected subgradient methods for learning sparse gaussians. Proc. of the Conf. on Uncertainty in AI. Friedman, J., Hastie, T., & Tibshirani, R. (2007). Sparse inverse covariance estimation the graphical lasso. Biostatistics, 432­441. George, E., & McCulloch, R. (1997). Approaches for Bayesian variable selection. Statistica Sinica, 7, 339­ 373. Ghahramani, Z., & Beal, M. (2000). Propagation algorithms for variational Bayesian learning. Advances in Neural Info. Proc. Systems (pp. 507­513). Gu, L., Xing, E., & Kanade, T. (2007). Learning gmrf structures for spatial priors. Proc. IEEE Conf. on Computer Vision and Pattern Recognition. Heckerman, D., Chickering, D., Meek, C., Rounthwaite, R., & Kadie, C. (2000). Dependency networks for density estimation, collaborative filtering, and data visualization. J. of Machine Learning Research, 1, 49­75. Huang, J., Liu, N., Pourahmadi, M., & Liu, L. (2006). Covariance selection and estimation via penalized normal likelihood. Biometrika, 93, 85­98. Ledoit, O., & Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. J. of Multivariate Analysis, 88, 365­411. Lee, S.-I., Ganapathi, V., & Koller, D. (2006). Efficient structure learning of Markov networks using L1-regularization. Advances in Neural Info. Proc. Systems (pp. 817­824).

Lenkoski, A., & Dobra, A. (2008). Bayesian structural learning and estimation in Gaussian graphical models (Technical Report 545). Department of Statistics, University of Washington. Mansinghka, V., Kemp, C., Tenenbaum, J., & Griffiths, T. (2006). Structured priors for structure learning. Proc. of the Conf. on Uncertainty in AI. Meinshausen, N., & Buhlmann, P. (2006). High dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34, 1436­1462. Nowicki, K., & Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96, 1077­1087. Rajarratnam, B., Massam, H., & Carvahlo, C. (2008). Flexible covariance estimation in graphical Gaussian models. Annals of Statistics, 36, 2818­2849. Rothman, A., Bickel, P., Levina, E., & Zhu, J. (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2, 494­515. Schmidt, M., van den Berg, E., Friedlander, M., & Murphy, K. (2009). Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm. AI & Statistics. Shachter, R., & Kenley, C. R. (1989). Gaussian influence diagrams. Managment Science, 35, 527­550. Smith, M., & Kohn, R. (2002). Parsimonious covariance matrix estimation for longitudinal data. J. of the Am. Stat. Assoc., 1141­1153. Speed, T., & Kiiveri, H. (1986). Gaussian Markov distributions over finite graphs. Annals of Statistics, 14, 138­150. Wainwright, M., Ravikumar, P., & Lafferty, J. (2007). High-dimensional graphical model selection using 1 -regularized logistic regression. In Advances in neural info. proc. systems, 1465­1472. Yuan, M., & Lin, Y. (2007). Model selection and estimation in the gaussian graphical model. Biometrika, 94, 19­35.