Multi-Task Learning of Gaussian Graphical Models Jean Honorio Dimitris Samaras Stony Brook University, Stony Brook, NY 11794, USA jhonorio@cs.sunysb.edu samaras@cs.sunysb.edu Abstract We present multi-task structure learning for Gaussian graphical models. We discuss uniqueness and boundedness of the optimal solution of the maximization problem. A block coordinate descent method leads to a provably convergent algorithm that generates a sequence of positive definite solutions. Thus, we reduce the original problem into a sequence of strictly convex regularized quadratic minimization subproblems. We further show that this subproblem leads to the continuous quadratic knapsack problem, for which very efficient methods exist. Finally, we show promising results in a dataset that captures brain function of cocaine addicted and control subjects under conditions of monetary reward. precision matrix are equivalent measures of complexity. Therefore, several techniques focus on enforcing sparseness of the precision matrix. An approximation method proposed in (Meinshausen & B¨hlmann, 2006) u relied on a sequence of sparse regressions. Maximum likelihood estimation with an 1 -norm penalty for encouraging sparseness is proposed in (Banerjee et al., 2006; Friedman et al., 2007; Yuan & Lin, 2007). Suppose that we want to learn the structure of brain region interactions for one person. We can expect that the interaction patterns of two persons are not same. On the other hand, when learning the structure for one person, we would like to use evidence from other persons as a side information in our learning process. This becomes more important in settings with limited amount of data, such as in functional magnetic resonance image (fMRI) studies. Multi-task learning allows for a more efficient use of training data which is available for multiple related tasks. In this paper, we consider the computational aspect of multi-task structure learning, which generalizes the learning of sparse Gaussian graphical models to the multi-task setting by replacing the 1 -norm regularization with an 1, -norm. Our contribution in this paper is three-fold. First, we present a block coordinate descent method which is provably convergent and yields sparse and positive definite estimates. Second, we show the connection between our multi-task structure learning problem and the continuous quadratic knapsack problem, which allows us to use existing efficient methods (Helgason et al., 1980; Brucker, 1984; Kiwiel, 2007). Finally, we experimentally show that the cross-validated loglikelihood of our method is more stable and statistically significantly higher than the competing methods in a fMRI dataset that captures brain function of cocaine addicted and control subjects under conditions of monetary reward. Section 2 introduces Gaussian graphical models as well as techniques for learning such structures from data. 1. Introduction Structure learning aims to discover the topology of a probabilistic network of variables such that this network represents accurately a given dataset while maintaining low complexity. Accuracy of representation is measured by the likelihood that the model explains the observed data, while complexity of a graphical model is measured by its number of parameters. Structure learning faces several challenges: the number of possible structures is super-exponential in the number of variables while the required sample size might be even exponential. Therefore, finding good regularization techniques is very important in order to avoid overfitting and to achieve a better generalization performance. For Gaussian graphical models, the number of parameters, the number of edges in the structure and the number of non-zero elements in the inverse covariance or Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Multi-Task Learning of Gaussian Graphical Models Table 1. Notation used in this paper. Notation Description N c 1 1 -norm of c R , i.e. n |cn | c -norm of c RN , i.e. maxn |cn | diag(c) matrix with elements of c RN on its diagonal RN ×N A 0 A RN ×N is symmetric and positive semidefinite A 0 A RN ×N is symmetric and positive definite M ×N A 1 , i.e. 1 -norm of A R mn |amn | M ×N A , i.e. maxmn |amn | -norm of A R A 2 spectral norm of A RN ×N , i.e. the maximum eigenvalue of A 0 A F Frobenius norm of A RM ×N , i.e. 2 mn amn A, B scalar product of A, B RM ×N , i.e. mn amn bmn for > 0. The term 1 encourages sparseness of the precision matrix or conditional independence among variables, while the term () is the Gaussian loglikelihood, and it is defined as: () = log det - , (2) Several optimization techniques have been proposed for eq.(1): a sequence of box-constrained quadratic programs in the covariance selection (Banerjee et al., 2006), solution of the dual problem by sparse regression in the graphical lasso (Friedman et al., 2007) or an approximation via standard determinant maximization with linear inequality constraints in (Yuan & Lin, 2007). Instead of solving eq.(1), the MeinshausenB¨hlmann approximation (Meinshausen & B¨hlmann, u u 2006) obtains the conditional dependencies by performing a sparse linear regression for each variable, by using lasso regression (Tibshirani, 1996). Besides sparseness, several regularizers have been proposed for Gaussian graphical models for singletask learning, for enforcing diagonal structure (Levina et al., 2008), block structure for known block-variable assignments (Duchi et al., 2008a) and unknown blockvariable assignments (Marlin & K.Murphy, 2009; Marlin et al., 2009), or spatial coherence (Honorio et al., 2009). Multi-task learning has been applied to very diverse problems, such as linear regression (Liu et al., 2009), classification (Jebara, 2004), compressive sensing (Qi et al., 2008), reinforcement learning (Wilson et al., 2007) and structure learning of Bayesian networks (Niculescu-Mizil & Caruana, 2007). Section 3 sets up the problem and discusses some of its properties. Section 4 describes our block coordinate descent method. Section 5 shows the connection to the continuous quadratic knapsack problem. Experimental results are shown and explained in Section 6. Main contributions and results are summarized in Section 7. 2. Background In this paper, we use the notation in Table 1. A Gaussian graphical model is a graph in which all random variables are continuous and jointly Gaussian. This model corresponds to the multivariate normal distribution for N variables with covariance matrix RN ×N . Conditional independence in a Gaussian graphical model is simply reflected in the zero entries of the precision matrix = -1 (Lauritzen, 1996). Let = {n1 n2 }, two variables n1 and n2 are conditionally independent if and only if n1 n2 = 0. The concept of robust estimation by performing covariance selection was first introduced in (Dempster, 1972) where the number of parameters to be estimated is reduced by setting some elements of the precision matrix to zero. Since finding the most sparse precision matrix which fits a dataset is a NP-hard problem (Banerjee et al., 2006), in order to overcome it, several 1 -regularization methods have been proposed for learning Gaussian graphical models from data. Given a dense sample covariance matrix 0, the problem of finding a sparse precision matrix by regularized maximum likelihood estimation is given by: max 0 () 3. Preliminaries In this section, we set up the problem and discuss some of its properties. 3.1. Problem Setup We propose a prior that is motivated from the multitask learning literature. Given K arbitrary tasks, our goal is to learn one structure for each task, and to promote a consistent sparseness pattern across tasks. For a given task k, we learn a precision matrix (k) RN ×N for N variables. Our multi-task regularizer penalizes corresponding edges across tasks (i.e. (1) (K) n1 n2 , . . . , n1 n2 ) with the same strength whether it appears in 1, 2, . . . or K tasks. As a result, only edges that help to explain the observed data for almost every task, will appear in the learnt structures. Let (k) 0 be the dense sample covariance matrix - 1 (1) Multi-Task Learning of Gaussian Graphical Models for task k, and T (k) > 0 be the number of samples in task k. The multi-task structure learning problem is defined as: max T (k) k (k) (k) ) (k) ( (k)(k) 0 - 1, (3) the order of max and min. Furthermore, note that the optimal solution of the inner equation is independent (k) for each k and is given by (k) = ((k) + A(k) )-1 . T By replacing this solution in eq.(6), we get the dual problem of eq.(3): min - k for > 0. The term (k) ( ) is the Gaussian loglikelihood defined in eq.(2), while the term 1, is our multi-task regularizer, and it is defined as: 1, (n1 n2 ) an1 n2 1 T (k) log det (k) + A(k) T (k) - NK (7) = n1 n2 (k) max |n1 n2 | k (4) The number of samples is a term that is usually dropped for covariance selection and graphical lasso as in eq.(1). For the multi-task structure learning problem, it is important to keep this term when adding the log-likelihood of several tasks into a single objective function. 3.2. Bounds In what follows, we discuss uniqueness and boundedness of the optimal solution of the multi-task structure learning problem. Lemma 1. For > 0, the multi-task structure learning problem in eq.(3) is a maximization problem with concave (but not strictly concave) objective function and convex constraints. Proof. The Gaussian log-likelihood defined in eq.(2) is convex, since log det is concave on the space of symmetric positive definite matrices and ·, · is a linear operator. The multi-task regularizer defined in eq.(4) is a non-smooth convex function. Finally, (k) 0 is a convex constraint. Theorem 2. For > 0, the optimal solution to the multi-task structure learning problem in eq.(3) is unique and bounded as follows: 1 (k) 2 In order to find a lowerbound for the mini -1 mum eigenvalue of (k) , note that (k) 2 = (k) (k) A A (k) (k) (k) + T (k) 2 2 + T (k) 2 = 2 + 1 N (k) (k) (k) A Since 2 2 + T (k) A . T (k) an1 n2 1 , it follows that |an1 n2 | = in the ex(k ) treme case in which (k1 = k)an11n2 = 0, and therefore (k) A . In order to find an upperbound for the maximum eigenvalue of (k) , note that, at optimum, the primaldual gap is zero: -N K + k (k) T (k) (k) , (k) + 1, =0 (8) The upperbound is found as follows: (k) 2 (k) F (k) 1 1, = (N K - (k) (k) (k) , )/ and since (k) 0 and kT (k) (k) (k) 0, it follows that , 0. 4. Block Coordinate Descent Method In this section, we develop a block coordinate descent method for our multi-task structure learning problem, and discuss some of its properties. We apply block coordinate descent method on the primal problem, unlike covariance selection (Banerjee et al., 2006) and graphical lasso (Friedman et al., 2007) which optimize the dual. Our choice of optimizing the primal follows from the fact that the dual formulation in eq.(7) leads to a sum of K terms (log det functions) which cannot be simplified to a quadratic problem unless K = 1. For clarity of exposition, we assume that the diagonals of (1) , . . . , (K) are not penalized by our multitask regularizer defined in eq.(4). In case they are penalized, an additional continuous logarithmic knapsack problem needs to be solved. We point out that all the following theorems and lemmas still hold in this case. Lemma 3. The solution sequence generated by the block coordinate descent method is bounded and every + N T (k) I (k) NK = max a I (5) aT c Proof. By using the identity c in eq.(3), we get: max min T (k) k 1 (k)(k) 0 (6) (1) (K) where an1 n2 = (an1 n2 , . . . , an1 n2 )T and A(k) RN ×N . By virtue of Sion's minimax theorem, we can swap (n1 n2 ) an1 n2 1 log det (k) (k) - (k) + A(k) , (k) T Multi-Task Learning of Gaussian Graphical Models cluster point is a solution of the multi-task structure learning problem in eq.(3). Proof. The non-smooth regularizer 1, is separable into a sum of O(N 2 ) individual functions of the (k) form maxk |n1 n2 |. These functions are defined over (1) (K) blocks of K variables, i.e. n1 n2 , . . . , n1 n2 . The objective function in eq.(3) is continuous on a compact level set. By virtue of Theorem 4.1 in (Tseng, 2001), we prove our claim. Theorem 4. The block coordinate descent method for the multi-task structure learning problem in eq.(3) generates a sequence of positive definite solutions. Proof. Maximization can be performed with respect to one row and column of all precision matrices (k) at a time. Without loss of generality, we use the last row and column in our derivation, since permutation of rows and columns is always possible. Let: (k) = W(k) y(k) , (k) = T y(k) z (k) S(k) u(k) T u(k) v (k) (9) equivalent to solving a sequence of strictly convex 1, regularized quadratic subproblems: -1 1 (k) T (k) y v W(k) y(k) (k) 2 kT T min +u(k) y(k) (k) RN -1 (k)y (k) + n maxk |yn | (12) Proof. By replacing the optimal z (k) given by eq.(11) into the objective function in eq.(10), we get eq.(12). -1 Since W(k) 0 W(k) 0, hence eq.(12) is strictly convex. Lemma 6. If maxn k T (k) |un | , the 1, regularized quadratic problem in eq.(12) has the minimizer (k)y(k) = 0. Proof. The problem in eq.(12) has the min imizer (k)y(k) = 0 if and only if 0 belongs to the subdifferential set of the nonsmooth objective function at (k)y(k) = 0, i.e. (A RN -1×K )(T (1) u(1) , . . . , T (K) u(K) ) + A = 0 maxn k ank . This condition is true for (k) maxn k |T (k) un | and since (k)T (k) > 0, we prove our claim. Remark 7. By using Lemma 6, we can reduce the size of the original problem by removing variables in which this condition holds, since it only depends on the dense sample covariance matrix. Theorem 8. The coordinate descent method for the 1, regularized quadratic problem in eq.(12) is equivalent to solving a sequence of strictly convex regularized separable quadratic subproblems: min 1 T x diag(q)x - cT x + x 2 (k) where W(k) , S(k) RN -1×N -1 , y(k) , u(k) RN -1 . In terms of the variables y(k) , z (k) and the constant matrix W(k) , the multi-task structure learning problem in eq.(3) can be reformulated as: T -1 log(z (k) - y(k) W(k) y(k) ) T T max k -2u(k) y(k) - v (k) z (k) (k)(k) 0 (k) -2 n maxk |yn | (10) (k) If (k) is a symmetric matrix, according to the Haynsworth inertia formula, (k) 0 if and only if T -1 its Schur complement z (k) - y(k) W(k) y(k) > 0 and (k) W 0. By maximizing eq.(10) with respect to z (k) , we get: z (k) - y(k) W(k) (k) T -1 (k) xRK (13) Proof. Without loss of generality, we use the last row and column in our derivation, since permutation of rows and columns is always possible. Let: W(k) = where (k) H11 -1 y = 1 v (k) (11) H11 h12 h12 R (k) T (k) (k) and since v > 0, this implies that the Schur complement in eq.(11) is positive. Finally, in an iterative optimization algorithm, it suffices to initialize (k) to a matrix that is known to be positive definite, e.g. a diagonal matrix with positive elements. Theorem 5. The block coordinate descent method for the multi-task structure learning problem in eq.(3) is h22 (k) , y(k) = , (k) u1 y1 , u(k) = (k) xk u2 (k) N -2×N -2 (k) (k) (k) h12 , y1 , u1 (k) T (k) R N -2 (14) . In terms of the variable x and the constants qk = T (k) v (k) h22 , ck = -T (k) (v (k) h12 y1 + u2 ), the 1, regularized quadratic problem in eq.(12) can be reformulated as in eq.(13). Moreover, since (k)T (k) > (k) 0 v (k) > 0 h22 > 0 q > 0, and therefore eq.(13) is strictly convex. (k) (k) Multi-Task Learning of Gaussian Graphical Models 5. Continuous Quadratic Knapsack Problem In this section, we show the connection between the multi-task structure learning problem and the continuous quadratic knapsack problem, for which very efficient methods exist. The continuous quadratic knapsack problem has been solved in several areas. (Helgason et al., 1980) provides an O(K log K) algorithm which initially sort the breakpoints. (Brucker, 1984) and later (Kiwiel, 2007) provide deterministic linear-time algorithms by using medians of breakpoint subsets. In the context of machine learning, (Duchi et al., 2008b) provides a randomized linear-time algorithm, while (Liu et al., 2009) provides an O(K log K) algorithm. We point out that (Duchi et al., 2008b; Liu et al., 2009) assume that the weights of the quadratic term are all equal, i.e. (k)qk = 1. In this paper, we assume arbitrary positive weights, i.e. (k)qk > 0. We point out to the reader, that the variables y, z used in this section have a different meaning with respect to the previous sections. We prefer to use them, since those are variables regularly used as unknowns. Theorem 9. For q > 0, > 0, the regularized separable quadratic problem in eq.(13) is equivalent to the separable quadratic problem with one 1 constraint: min 1 T -1 (y - c) diag(q) (y - c) 2 (15) Proof. We prove this by contradiction. Assume (k1 )yk1 ck1 < 0. Let y be a vector such that yk1 = 0 and (k2 = k1 )yk2 = yk2 . The solution y is feasible, since y 1 and y 1 = y 1 - |yk1 | . The difference in the objective function T -1 between y and y is 1 (y - c) diag(q) (y - c) - 2 T -1 2 1 (y - c) = 2q1 (yk1 - ck1 yk1 ) > 2 (y - c) diag(q) k > 0. Thus, the objective function for y is smaller than for y (the assumed optimal solution), which is a contradiction. Theorem 13. For q > 0, (k)ck = 0, c 1 > , the separable quadratic problem with one 1 constraint in eq.(15) is equivalent to the continuous quadratic knapsack problem: z0 1T z= yk1 2 2qk1 1 min k 1 (zk - |ck |)2 2qk (16) Furthermore, their optimal solutions are related by (k)yk = sgn(ck )zk . Proof. By invoking Lemma 12, we can replace (k)yk = sgn(ck )zk , zk 0 in eq.(15). Finally, we change the inequality constraint 1T z to an equality constraint since c 1 > and therefore, the optimal solution must be on the boundary of the constraint set. Lemma 14. For q > 0, (k)ck = 0, c 1 > , the continuous quadratic knapsack problem in eq.(16) has the solution zk () = max(0, |ck |-qk ) for some , and furthermore: z = z() 1T z() = Proof. The Lagrangian of eq.(16) is: min z0 k y 1 Furthermore, their optimal solutions are related by -1 x = diag(q) (c - y ). Proof. By Lagrangian duality, the problem in eq.(15) is the dual of the problem in eq.(13). Furthermore, strong duality holds in this case. Remark 10. In eq.(15), we can assume that (k)ck = 0. If (k)ck = 0, the partial optimal solution is yk = 0, and since this assignment does not affect the constraint, we can safely remove yk from the optimization problem. Remark 11. In what follows, we assume that c 1 > . If c 1 , the unconstrained optimal solution of eq.(15) is also its optimal solution, since y = c is inside the feasible region given that y 1 . Lemma 12. For q > 0, (k)ck = 0, c 1 > , the optimal solution y of the separable quadratic problem with one 1 constraint in eq.(15) belongs to the same orthant as the unconstrained optimal solution c, i.e. (k)yk ck 0. (17) 1 (zk - |ck |)2 + (1T z - ) 2qk (18) Both results can be obtained by invoking the KarushKuhn-Tucker optimality conditions on eq.(18). Remark 15. Note that zk () = max(0, |ck | - qk ) is a decreasing piecewise linear function with breakpoint = |ck | > 0. By Lemma 14, finding the optimal z is qk equivalent to finding in a piecewise linear function 1T z() that produces . Lemma 16. For q > 0, (k)ck = 0, c 1 > , the continuous quadratic knapsack problem in eq.(16) has the optimal solution zk = max(0, |ck | - qk ) for: |ck | = qk k k=1 |ck | - k k=1 qk |ck +1 | qk +1 (19) Multi-Task Learning of Gaussian Graphical Models where the breakpoints are sorted in decreasing order by |c | a permutation of the indices 1, 2, . . . , K, i.e. q1 |c2 | q 2 Algorithm 1 Block Coordinate Descent Input: > 0, for each k, (k) 0, T (k) > 0 Initialize for each k, (k) = diag((k) )-1 for each iteration 1, . . . , L and each variable 1, . . . , N do Split for each k, (k) into W(k) , y(k) , z (k) and (k) into S(k) , u(k) , v (k) as described in eq.(9) -1 Update for each k, W(k) by using the ShermanWoodbury-Morrison formula (Note that when iterating from one variable to the next one, only one row and column change on matrix W(k) ) for each variable 1, . . . , N - 1 do -1 Split for each k, W(k) , y(k) , u(k) as in eq.(14) Solve the regularized separable quadratic problem by eq.(20), either by sorting the breakpoints or using medians of breakpoint subsets end for T -1 1 Update for each k, z (k) v(k) +y(k) W(k) y(k) end for Output: for each k, (k) 0 ··· |cK | q K |cK+1 | qK+1 1 0. Proof. Given k , can be found straightforwardly by using the equation of the line. In order to find k , note |c | that we want to find the range in which 1T z qk 1T z |ck +1 | qk +1 k . Theorem 17. For q > 0, > 0, the regularized separable quadratic problem in eq.(13) has the optimal solution: c c c 1 1 1 x = 0 > k > k x k = c k q k > k k x k = sgn(ck ) k k=1 |ck |- k k=1 qk (20) Proof. For c 1 , from Remark 11 we know that y = c. By Theorem 9, the optimal solution of eq.(13) -1 is x = diag(q) (c - y ) = 0, and we prove the first claim. For c 1 > , by Theorem 9, the optimal solution of 1 eq.(13) x k = q (ck - yk ). By Theorem 13, x k = 1 q k (ck - sgn(ck )zk ). By Lemma 16, x k = |ck | qk | k 6. Experimental Results For experimental validation, we used a fMRI dataset that captures brain function of cocaine addicted and control subjects under conditions of monetary reward. The dataset collected by (Goldstein et al., 2007) contains 28 subjects: 16 cocaine addicted and 12 control. Six sessions were acquired for each subject. Each session contains 87 scans taken every 3.5 seconds. Registration of the dataset to the same spatial reference template (Talairach space) and spatial smoothing was performed in SPM21 . We extracted voxels from the gray matter only, and grouped them into 157 regions by using standard labels, given by the Talairach Daemon2 . These regions span the entire brain (cerebellum, cerebrum and brainstem). In order to capture laterality effects, we have regions for the left and right side of the brain. First, we test the idea of learning one Gaussian graphical model for each of the six sessions, i.e. each session is a task. We performed five-fold cross-validation on the subjects, and report the log-likelihood on the testing set (scaled for visualization purposes). In Figure 1, we can observe that the log-likelihood of our method is higher than the competing methods. Second, we test the idea of learning one Gaussian graphical model for each subject, i.e. each subject is a task. It is well known that fMRI datasets have more 1 2 c k q k - sgn(ck ) max(0, |c - ). c k q k If k > k qk < x k = k the second claim. |ck | q k , and we prove If k k x k = sgn(ck ) , and we prove the third claim. Algorithm 1 shows the block coordinate descent method in detail. A careful implementation of the algorithm allows obtaining a time complexity of O(LN 3 K) for L iterations, N variables and K tasks. In our experiments, the algorithm converges quickly in usually L = 10 iterations. The polynomial dependence O(N 3 ) on the number of variables is expected since we cannot produce an algorithm faster than computing the inverse of the sample covariance in the case of an infinite sample. The linear-time dependence O(K) on the number of tasks can be accomplished by using a deterministic linear-time method for solving the continuous quadratic knapsack problem, based on medians of breakpoint subsets (Kiwiel, 2007). A very easy-to-implement O(K log K) algorithm is obtained by initially sorting the breakpoints and searching the range for which Lemma 16 holds. http://www.fil.ion.ucl.ac.uk/spm/ http://www.talairach.org/ Multi-Task Learning of Gaussian Graphical Models MA MO GL CS TR MT MA MO GL CS TR MT 0 -0.2 -0.4 -0.6 -0.8 -1 0 -0.2 -0.4 -0.6 -0.8 -1 Table 2. Z-statistic for the difference of log-likelihoods between our technique and each competing method, for 16 cocaine addicted subjects. Expect for few cases (marked with an asterisk), our method is statistically significantly better (95%, Z > 1.65) than Meinshausen-B¨hlmann with u AND-rule (MA), OR-rule (MO), graphical lasso (GL), covariance selection (CS) and Tikhonov regularization (TR). Method MA MO GL CS TR Method MA MO GL CS TR S1 27.4 25.6 2.0 2.0 15.4 S9 9.6 13.5 4.3 4.3 3.2 10 Log-likelihood Log-likelihood Regularization parameter Regularization parameter Figure 1. Cross-validated log-likelihood of structures learnt for each of the six sessions on cocaine addicted subjects (left) and control subjects (right). Our multi-task method (MT) outperforms Meinshausen-B¨hlmann with AND-rule u (MA), OR-rule (MO), graphical lasso (GL), covariance selection (CS) and Tikhonov regularization (TR). MA MO GL CS TR MT MA MO GL CS TR MT S2 S3 S4 S5 S6 S7 S8 14.7 9.1 12.0 18.9 10.4 9.0 19.6 17.0 10.4 13.7 19.4 10.4 10.7 20.4 2.7 1.9 1.5* 0.7* 1.8 2.7 2.2 2.6 1.9 1.5* 0.7* 1.8 2.7 2.1 5.1 3.6 6.3 10.3 6.7 3.5 12.0 S10 S11 S12 S13 S14 S15 S16 8.1 8.5 17.2 23.2 19.2 19.5 10.3 11.2 10.1 18.5 22.5 17.6 21.1 13.9 2.9 1.8 1.8 1.8 2.2 4.7 2.9 2.9 1.8 1.8 1.8 2.2 4.7 2.9 2.2 3.7 8.8 8.8 11.4 8.0 5.0 10 10 0. 0 0. 002 00 0 0. 53 00 0. 14 00 38 0. 0 0. 1 02 0. 7 07 1 0. 19 0. 5 0. 0 0. 002 00 0 0. 53 00 0. 14 00 38 0. 0 0. 1 02 0. 7 07 1 0. 19 0 -0.2 -0.4 -0.6 0 -0.2 -0.4 -0.6 -0.8 -1 0. 5 9 8 7 6 1 2 3 8 7 9 1 2 3 8 7 9 1 2 3 Log-likelihood Log-likelihood -0.8 -1 9 5 10 4 S1 1 2 3 8 7 6 5 10 4 S3 1 2 3 8 7 6 5 10 4 S5 1 2 3 9 9 0. 0 0. 002 00 0 0. 53 00 0. 14 00 38 0. 0 0. 1 02 0. 7 07 1 0. 19 0. 5 0. 0 0. 002 00 0 0. 53 00 0. 14 00 38 0. 0 0. 1 02 0. 7 07 1 0. 19 5 8 7 6 5 4 Regularization parameter Regularization parameter Figure 2. Cross-validated log-likelihood of structures learnt for each subject on cocaine addicted subjects (left) and control subjects (right). Our multi-task method (MT) is more stable for low regularization levels and outperforms Meinshausen-B¨hlmann with AND-rule (MA), ORu rule (MO), graphical lasso (GL), covariance selection (CS) and Tikhonov regularization (TR). 0. 6 S1 5 4 S3 6 5 4 S5 variability across subjects than across sessions of the same subject. Therefore, our cross-validation setting works as follows: we use one session as training set, and the remaining five sessions as testing set. We repeat this procedure for all the six sessions and report the log-likelihood (scaled for visualization purposes). In Figure 2, similar to the previous results, we can observe that the log-likelihood of our method is higher than the competing methods. Moreover, our method is more stable for low regularization levels than the other methods in our evaluation, which perform very poorly. In order to measure the statistical significance of our previously reported log-likelihoods, we further compared the best parameter setting for each of the techniques. In Table 2, we report the two sample Zstatistic for the difference of our technique minus each competing method. Except for few subjects, the cross- Figure 3. Subgraph of ten brain regions from learnt structures for three randomly selected cocaine addicted subjects, for our multi-task method (top) and graphical lasso (bottom). Regularization parameter = 0.027. Positive interactions are shown in blue, negative interactions are shown in red. Notice that sparseness of our structures is consistent across subjects. validated log-likelihood of our method is statistically significantly higher (95%, Z > 1.65). We show a subgraph of learnt structures for three randomly selected cocaine addicted subjects in Figure 3. We can observe that the sparseness pattern of the structures produced by our multi-task method is consistent across subjects. 7. Conclusions and Future Work In this paper, we generalized the learning of sparse Gaussian graphical models to the multi-task setting by replacing the 1 -norm regularization with an 1, norm. We presented a block coordinate descent method which is provably convergent and yields sparse and positive definite estimates. We showed the connec- Multi-Task Learning of Gaussian Graphical Models tion between our multi-task structure learning problem and the continuous quadratic knapsack problem. Finally, we experimentally showed that the crossvalidated log-likelihood of our method is more stable and statistically significantly higher than the competing methods in a brain fMRI dataset. There are several ways of extending this research. Methods for selecting the regularization parameter need to be further investigated. In practice, our technique converges in a small number of iterations, but a more precise analysis of the rate of convergence needs to be performed. Finally, model selection consistency when the number of samples grows to infinity needs to be proved. Honorio, J., Ortiz, L., Samaras, D., Paragios, N., and Goldstein, R. Sparse and locally constant Gaussian graphical models. Neural Information Processing Systems, 2009. Jebara, T. Multi-task feature and kernel selection for SVMs. International Conference on Machine Learning, 2004. Kiwiel, K. On linear-time algorithms for the continuous quadratic knapsack problem. Journal of Optimization Theory and Applications, 2007. Lauritzen, S. Graphical Models. Oxford Press, 1996. Levina, E., Rothman, A., and Zhu, J. Sparse estimation of large covariance matrices via a nested lasso penalty. The Annals of Applied Statistics, 2008. Liu, H., Palatucci, M., and Zhang, J. Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. International Conference on Machine Learning, 2009. Marlin, B. and K.Murphy. Sparse Gaussian graphical models with unknown block structure. International Conference on Machine Learning, 2009. Marlin, B., Schmidt, M., and Murphy, K. Group sparse priors for covariance estimation. Uncertainty in Artificial Intelligence, 2009. Meinshausen, N. and B¨hlmann, P. High dimensional u graphs and variable selection with the lasso. The Annals of Statistics, 2006. Niculescu-Mizil, A. and Caruana, R. Inductive transfer for Bayesian network structure learning. Artificial Intelligence and Statistics, 2007. Qi, Y., Liu, D., Carin, L., and Dunson, D. Multitask compressive sensing with Dirichlet process priors. International Conference on Machine Learning, 2008. Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 1996. Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 2001. Wilson, A., Fern, A., Ray, S., and Tadepalli, P. Multitask reinforcement learning: A hierarchical Bayesian approach. International Conference on Machine Learning, 2007. Yuan, M. and Lin, Y. Model selection and estimation in the Gaussian graphical model. Biometrika, 2007. Acknowledgments We thank Rita Goldstein for providing us the fMRI dataset, and Luis Ortiz for his guidance and helpful comments. This work was supported in part by NIDA Grants 1 R01 DA020949 and 1 R01 DA023579. References Banerjee, O., El Ghaoui, L., d'Aspremont, A., and Natsoulis, G. Convex optimization techniques for fitting sparse Gaussian graphical models. International Conference on Machine Learning, 2006. Brucker, P. An O(n) algorithm for quadratic knapsack problems. Operations Research Letters, 1984. Dempster, A. Covariance selection. Biometrics, 1972. Duchi, J., Gould, S., and Koller, D. Projected subgradient methods for learning sparse Gaussians. Uncertainty in Artificial Intelligence, 2008a. Duchi, J., Shalev-Shwartz, S., Singer, Y., and Chandra, T. Efficient projections onto the 1 -ball for learning in high dimensions. International Conference on Machine Learning, 2008b. Friedman, J., Hastie, T., and Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 2007. Goldstein, R., Tomasi, D., Alia-Klein, N., Zhang, L., Telang, F., and Volkow, N. The effect of practice on a sustained attention task in cocaine abusers. NeuroImage, 2007. Helgason, K., Kennington, J., and Lall, H. A polynomially bounded algorithm for a singly constrained quadratic program. Mathematical Programming, 1980.