Group Lasso with Overlap and Graph Lasso Laurent Jacob LAURENT. JACOB @ MINES - PARISTECH . FR Mines ParisTech ­ CBIO, INSERM U900, Institut Curie, 26 rue d'Ulm, Paris cedex 05, F-75248 France Guillaume Obozinski1 GOBO @ STAT. BERKELEY. EDU INRIA ­ Willow Project-Team, Ecole Normale Supérieure, 45 rue d'Ulm, Paris cedex 05, F-75230 France Jean-Philippe Vert JEAN - PHILIPPE . VERT @ MINES - PARISTECH . FR Mines ParisTech ­ CBIO, INSERM U900, Institut Curie, 26 rue d'Ulm, Paris cedex 05, F-75248 France Abstract We propose a new penalty function which, when used as regularization for empirical risk minimization procedures, leads to sparse estimators. The support of the sparse vector is typically a union of potentially overlapping groups of covariates defined a priori, or a set of covariates which tend to be connected to each other when a graph of covariates is given. We study theoretical properties of the estimator, and illustrate its behavior on simulated and breast cancer gene expression data. 1. Introduction Estimation of sparse linear models by the minimization of an empirical error penalized by a regularization term is a very popular and successful approach in statistics and machine learning. Controlling the trade-off between data fitting and regularization, one can obtain estimators with good statistical properties, even in very large dimension. Moreover, sparse classifiers lend themselves particularly well to interpretation, which is often of primary importance in many applications such as biology or social sciences. A popular example is the penalization of a 2 criterion by the 1 norm of the estimator, known as lasso (Tibshirani, 1996) or basis pursuit (Chen et al., 1998). Interestingly, the lasso is able to recover the exact support of a sparse model from data generated by this model if the covariates are not too correlated (Zhao & Yu, 2006; Wainwright, 2006). 1 This work was undertaken while Guillaume Obozinski was affiliated with UC Berkeley, Department of Statistics. While the 1 norm penalty leads to sparse models, it does not contain any prior information about, e.g., possible groups of covariates that one may wish to see selected jointly. Several authors have recently proposed new penalties to enforce the estimation of models with specific sparsity patterns. For example, when the covariates are partitioned into groups, the group lasso leads to the selection of groups of covariates (Yuan & Lin, 2006). The group lasso penalty for a model, also called 1 /2 penalty, is the sum (i.e., 1 norm) of the 2 norms of the restrictions of the model to the different groups of covariates. It recovers the support of a model if the support is a union of groups and if covariates of different groups are not too correlated. It can be generalized to an infinite-dimensional setting (Bach, 2008). Other variants of the group lasso include joint selection of covariates for multi-task learning (Obozinski et al., 2009) and penalties to enforce hierarchical selection of covariates, e.g., when one has a hierarchy over the covariates and wants to select covariates only if their ancestors in the hierarchy are also selected (Zhao et al., 2009; Bach, 2009). In this paper we are interested in a more general situation. We assume that either (i) groups of covariates are given, potentially with overlap between the groups, and we wish to estimate a model whose support is a union of groups, or (ii) that a graph with covariates as vertices is given, and we wish to estimate a model whose support contains covariates which tend to be connected to each others on the graph. Although quite general, this framework is motivated in particular by applications in bioinformatics, when we have to solve classification or regression problems with few samples in high dimension, such as predicting the class of a tumour from gene expression measurements with microarrays, and simultaneously select a few genes to establish a predictive signature (Roth, 2002). Selecting a few genes that either belong to the same functional groups, where the groups are given a priori and may overlap, or tend to be connected to each other in a given biological network, Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Group Lasso with Overlap and Graph Lasso could then lead to increased interpretability of the signature and potential better performances (Rapaport et al., 2007). To reach this goal, we propose and study a new penalty which generalizes the 1 /2 norm to overlapping groups for the first case, and propose to cast the problem of selecting connected covariates in a graph as the problem of selecting a union of overlapping groups, with adequate definition of groups, for the second case. We mention various properties of this penalty, and provide conditions for the consistency of support estimation in the regression setting. Finally, we report promising results on both simulated and real data. 2. Problem and notations For any vector w Rp , w denotes the Euclidean norm of w, and supp (w) [1, p] denotes the support of w, i.e., the set of covariates i [1, p] such that wi = 0. A group of covariates is a subset g [1, p]. The set of all possible groups is therefore P([1, p]), the power set of [1, p]. Throughout the paper, G P([1, p]) denotes a set of groups, usually fixed in advance for each application. We say that two groups overlap if they have at least one covariate in common. For any vector w Rp , and any group g G, we denote wg Rp the vector whose entries are the same as w for the covariates in g, and are 0 for other other covariates. However, we use a different convention for elements of VG Rp×G the set of |G|-tuples of vectors v = (vg )gG , where each vg is this time a separate vector in Rp , which satisfies supp (vg ) g for each g G. For any differentiable function f : Rp R, we denote by f (w) Rp the gradient of f at w Rp and by g f (w) Rg the partial gradient of f with respect to to the covariates in g. Figure 1. Balls for G (·) (left) and G group overlap (·) (right) for the groups G = {{1, 2}, {2, 3}} where w2 is represented as the vertical coordinate. We see that this penalty induces the estimation of sparse vectors, whose support in typically the complement of a union of groups. Although this may be relevant for some applications, with appropriately designed families of groups -- as considered by (Jenatton et al., 2009) -- , we are interested in this paper in penalties which induce the opposite effect: that the support of w be a union of groups. For that purpose, we propose instead the following penalty: G overlap (w) = vVG , inf P gG vg =w vg . gG (2) 3. Group lasso with overlapping groups When the groups in G do not overlap, the group lasso penalty (Yuan & Lin, 2006) is defined as: w Rp , G (w) = group gG When the groups do not overlap and form a partition of [1, p ], there exists a unique decomposition of w Rp as w = gG vg with supp (vg ) g, namely, vg = wg for all g G. In that case, both penalties (1) and (2) are the same. If some groups overlap, then we show below that this penalty induces the selection of w that can be decomposed as w = gG vg where some vg are equal to 0. If we denote by G1 G the set of groups g with vg = 0, then we immediately get w = gG1 vg , and therefore: supp (w) gG1 g. wg . (1) When the groups in G form a partition of the set of covariates, then G (w) is a norm whose balls have singularigroup ties when some wg are equal to zero. Minimizing a smooth convex risk functional over such a ball often leads to a solution that lies on a singularity, i.e., to a vector w such that wg = 0 for some of the g in G. When some of the groups in G overlap, the penalty (1) is still a norm (if all covariates are in at least one group) whose ball has singularities when some wg are equal to zero. Indeed, for a vector w, if we denote by G0 G the set of groups such that wg = 0, then supp (w) gG0 In other words, the penalty (2) leads to sparse solutions whose support is typically a union of groups, matching the setting of applications that motivate this work. In the rest of this paper, we therefore investigate in more details G overlap (.), both theoretically and empirically. Figure 1 shows the ball for both norms in R3 with groups G = {{1, 2}, {2, 3}}. The pillow shaped ball of G (·) group has four singularities corresponding to cases where either only w1 or only w3 is non-zero. By contrast, G overlap (·) has two circular sets of singularities corresponding to cases where (w1 , w2 ) only or (w2 , w3 ) only is non zero. 4. Some properties of G overlap (.) We first analyze the decomposition of a vector w Rp as gG vg induced by (2). For that purpose, let V(w) VG g c . Group Lasso with Overlap and Graph Lasso be the set of |G|-tuples of vectors v = (vg )gG which reach the minimum in (2), i.e., which satisfy w= gG vg and G overlap (w) = gG vg . The optimization problem (2) defining G overlap (w) is a convex problem and its objective is coercive, so that the set of solutions V(w) is non-empty and convex. Moreover, Lemma 1. w G overlap (w) is a norm. Proof. Positive homogeneity and positive definiteness hold trivially. We show the triangular inequality. Consider w, w Rp ; let (vg )g G and (vg )g G be respectively optimal decompositions of w and w so that G overlap (w) = G g vg and overlap (w ) = g vg . Since (vg + vg )gG is a (a priori non-optimal) decomposition of w + w , we clearly have G overlap (w + w ) gG vg + vg G G g ( vg + vg ) = overlap (w) + overlap (w ). Using the conic dual of (2), we give another formulation of the norm G overlap (.) yelding some important properties. Lemma 2. 1. It holds that: g 1 By strong duality (since, e.g., Slater's condition is fulfilled), the optimal value G overlap (w) of the primal is equal to the maximum of the dual problem. Maximizing this dual function over g = 1, g g and g = -g is equivalent to maximizing w over the vectors Rp such that g 1 for all g G, which proves (3). To prove the second point, we note that the variables (t, v, , , ) are primal/dual optimal for this convex optimization problem if and only if the Karush-Kuhn-Tucker (KKT) conditions are satisfied, i.e., if and only if, for all g G: supp (vg ) = g, vg tg and supp ( ) = g, g g g g = -g and g = 1 g vg + g tg = 0 w= gG vg Eliminating and with the stationarity conditions, all conditions are fulfilled if and only if w = gG vg and for all g G, (i) either vg = 0 and g 1, (ii) or vg = 0 and g = vg / vg . If a pair (, v) fulfills these conditions, then we obtain a primal/dual solution by taking tg = vg , g = -g and g = 1. This proves points 2 and 3. Denote by G1 the group-support of w, i.e., the set of groups belonging to the support of at least one optimal decomposition of w: G1 = {g G | v = (vg )g V(w), vg = 0} and J1 the corresponding set of variables J1 = gG1 g. G overlap (w) = supRp :gG, w . (3) 2. A vector Rp is a solution of (3) if and only if there exists v = (vg )gG V(w) such that: g G , if vg = 0, g = vg vg else g 1 (4) Lemma 3. Let be an optimum in the formulation (3) of the G overlap (·) norm, then J1 is uniquely defined. 3. Conversely, a G-tuple of vectors v = (vg )gG VG such that w = g vg is a solution to (2) if and only if there exists a vector Rp such that (4) holds. Proof. Let us introduce slack variables t = (tg )gG RG and rewrite the optimization problem (2) as follows: tRG ,vVG Proof. Consider any solution v = (vg )gG of (2). Let be any optimal solution of (3). Since (v, ) form a primal/dual pair, they must satisfy the KKT conditions. In particular, for all g such that vg = 0, g is defined uniquely by g = vg vg . Since this is true for all solutions v V(w), J1 is uniquely defined. Corollary 1. For any v, v V(w) and for any g G, vg × vg = 0 or g 0 s.t. vg = vg . min tg s.t. gG gG vg = w and g G, vg tg . (5) We can form a Lagrangian for this problem with the dual variables Rp for the constraint gG vg = w, and (, ) VG × RG with g g for the conic constraints vg tg , and get: L= gG Proof. If vg = 0 and vg = 0, let be solution of (3), by the previous lemma g is unique and g = vg vg = vg vg . tg + w - gG vg - g vg + g tg . gG 5. Using G overlap (.) as a penalty We now consider a learning scenario where we use G overlap (w) as a regularization term to the minimization of an objective function R(w), typically an empirical risk. We assume that R(w) is convex and differentiable in w, and consider the optimization problem: minwRp R(w) + G overlap (w) , (6) The minimum of L with respect to the primal variables t and v is non trivial only if g = 1 and g = -g for any g G. Therefore, we get the dual function: min L = t,v w - if g = 1 and g = -g for all g G , otherwise. Group Lasso with Overlap and Graph Lasso where > 0 is a regularization parameter. We first derive optimality conditions for any solution of (6). For that purpose, let us denote AG (w) the set of vectors Rp solution of (3). Lemma 4. A vector w Rp is a solution of (6) if and only if -R(w)/ AG (w). Proof. The proof follows from the same Lagrangian based derivation as for Lemma 2, adding only the loss term. Remark 1. By point 2 of Lemma 2, an equivalent formulation is the following: a vector w Rp is a solution of (6) if and only if it can be decomposed as w = gG vg where, for any g G, vg Rp , supp (vg ) = g, and if vg = 0 then g R(w) , and g R(w) = -vg / vg otherwise. 6. Consistency Before we present a consistency result on G overlap (.), we will need the following lemma. Lemma 5. Assume that for all w in a small neighbor hood U of w, w admits a unique decomposition (vg )gG of minimal norm supported by the same set of groups G1 as w. Writing g = vg , there exists a neighborhood U0 of wJ1 in R|J1 | and a neighborhood U0 of (J1 , G1 ) in |J1 |×|G1 | R such that there exists a unique continuous func tion : wJ1 (J1 (w), G1 (w)) from U0 to U0 . Proof. The dual problem (3) is equivalent to the saddlepoint problem min max L (, , w) s.t. g R+ with g lagrangian L (, , w) = - w + gG 2 ( g 2 - 1) and KKT conditions: g G, g 2 1, (primal feas.) g G, g 0, (dual feas.) By stationarity, (vg )gG defined by vg = g g is a decomposition of w; it is optimal because it satisfies property 3 of lemma 2; finally we have g = vg consistently with our definition of g (w). For any w with the same set of supporting groups G1 , we have g (w) = 1 for all g G1 and g = 0 for all g G\G1 . For all wJ1 with group-support no smaller than G1 , the corresponding pair (J1 (w), G1 (w)) is therefore a solution of the set of non-linear equations: i J1 , -wi + 2 gi then (7) is equivalent to F (wJ1 , J1 , G1 ) = 0. We use the implicit function theorem for non-differentiable function of (Kumagai, 1980). The theorem states that for a continuous function F : R|J1 | × R|J1 |×|G1 | R|J1 |×|G1 | such that F (w0 , (0 , 0 )) = 0, if there exist open neighborhoods U R|J1 | and U R|J1 |×|G1 | of w0 and (0 , 0 ) respectively, such that, for all w U , F (w, ·) : U R|J1 |×|G1 | is locally one-to-one then there exist open neighborhoods U0 R|J1 | and U0 R|J1 |×|G1 | of w0 and (0 , 0 ), such that, for all w U0 , the equation F (w, (, )) = 0 has a unique solution (, ) = (w) U0 , where is a con tinuous function from U0 into U0 . By continuity of the addition, the product and the Euclidean norm, the above defined F is continuous. For each w fixed, F (w, ·) is bijective, because of the assumption of the existence of a unique decomposition in a neighborhood of w. Applying the theorem of (Kumagai, 1980) then yields the desired result. We are now ready to prove the consistency of G overlap (.). Consider the linear regression model Y = X w + , where ¯ X Rn×p is a design matrix, Y Rp is the response vector and Rp is a vector of i.i.d. random variables with mean 0 and finite variance. We denote the true regression function by w. We assume that ¯ 1. (H1) := 1 nX X 0 2. (H2) There exists a neighborhood of w in which (2) ¯ has a unique solution. If G1 is the set of group supporting the unique solution of (2), we denote G2 = G\G1 and J2 = [1, p ]\J1 . For convenience, for any group of covariates g we note Xg the n × | g | design matrix restricted to the predictors in g, and for any two groups g, g we note gg = Xg Xg . We can then provide a condition under which minimizing the leastsquare error penalized by G overlap (w) leads to an estimator with the correct support. Consider the two conditions: ¯ g G2 , gJ1 -1J1 J1 (w) 1 J1 ¯ g G2 , gJ1 -1J1 J1 (w) < 1 J1 (C1) (C2) i [1, p], -wi + gi g i = 0, 2 g G, g ( g - 1) = 0, (stationarity) (comp.slack.) g i = 0 g G1 , g -1=0 (7) Lemma 6. With assumptions (H1-2), for n 0 and n n1/2 , conditions (C1) and (C2) are respectively necessary and sufficient for the solution of (6) to estimate consistently the group-support of w. ¯ Proof. We follow the line of proof of (Bach, 2008) but consider a fixed design for simplicity of notations. Let us first consider the subproblem of estimating a vector only on the support of w by using only the groups in ¯ J1 in the penalty, i.e., consider w1 RJ1 a solution of In other words consider the function F : R|J1 |×|J1 |×|G1 | R|J1 |×|G1 | (wJ1 , J1 , G1 ) -wi + g i ( g gi 2 - 1)gG1 iJ1 , Group Lasso with Overlap and Graph Lasso 1 minwJ1 RJ1 2n Y - XJ1 wJ1 + n G1 (wJ1 ) . By overlap standard arguments, we can prove that w1 converges in Euclidean norm to w restricted to J1 as n tends to in¯ finity (Fu & Knight, 2000). In the rest of the proof we show how to construct a vector w Rp from w1 which under condition (C2) is with high probability a solution to (6). By adding null components to w1 , we obtain a vector w Rp whose support is also J1 , and u = w - w ¯ therefore satisfies supp (u) J1 . A direct computation of the gradient of the risk R(w) = Y - Xw 2 gives 1 R(w) = u - W , where W = n X. From this -1 we deduce that u = J1 J1 (J1 R(w) + WJ1 ), and since J1 R(w) = -n J1 (w) we have : 2 no formal statement on how to chose k, it intuitively controls the size of the groups of connected variables which are selected, and should therefore be typically chosen to be slightly smaller than the size of the minimal connected component expected in the support of the model. 8. Implementation A simple way to implement empirical risk minimization using G overlap (.) as the regularizer is to explicitly duplicate the variables in the P design matrix, i.e., to replace ~ X Rn×p by X Rn× |g| defined by the concatenation of copies of the design matrix restricted each to a ~ certain group g, i.e., X = [Xg1 , Xg2 , ..., Xg|G| ], where G = {g1 , . . . , gG }. To see this, denote vg = (vgi )ig and ~ ~ v = (~g1 , . . . , vg|G| ) , and consider that, for an empiriv ~ ~ cal risk of the form R(w) = R(Xw), we can eliminate w ~ ~ ~~ from (6) to get R(w) = R(X( g vg )) = R(X v) and thus ~ X v) + ~~ for the full objective : R( ~ g vg . That way the P |g| ~ ~ vector v R can be directly estimated from X with a classical group lasso for non-overlapping groups. We implemented the approach of (Meier et al., 2008) to estimate the group lasso in the expanded space. Note that (Roth & Fischer, 2008) provides a faster algorithm for the group Lasso. When there are many groups with important overlap however, an alternative implementation without explicit data duplication, e.g., with a variational formulation similar to the one of (Rakotomamonjy et al., 2008) might be more scalable. J2 R(w) = J2 J1 -1J1 (WJ1 - n J1 (w)) - WJ2 . J1 To show that w is a feasible solution to (6) it is enough to show that g G2 , g R(w) n . Moreover, since the noise has bounded variance, J2 J1 -1J1 WJ1 - WJ2 = J1 1 XJ2 n XJ1 -1J1 XJ1 - I is n-consistent and J1 1 n g R(w) gJ1 -1J1 J1 (w) + Op (-1 n-1/2 ). n J1 By Lemma 5, we have that J1 is a continuous function P ¯ of w in a neighborhood of w so that wJ1 wJ1 im¯ P ¯ plies J1 (w) J1 (w). Since we chose n such that -1 n-1/2 0, we have n 1 n ¯ g R(w) gJ1 -1J1 J1 (w) + op (1). J1 Hence the result for the sufficient condition. Symmetrically, for the necessary condition we have 1 n 9. Experiments 9.1. Synthetic data: given overlapping groups To assess the performance of our method when overlapping groups are given as a priori, we simulated data with p = 82 variables, covered by 10 groups of 10 variables with 2 variables of overlap between two successive groups: {1, . . . , 10}, {9, . . . , 18}, . . . , {73, . . . , 82}. We chose the support of w to be the union of groups 4 and 5 and sampled both the support weights and the offset from i.i.d. Gaussian variables. Note that in this setting, the support can be expressed as a union of groups, but not as the complement of a union. Therefore, G overlap (.) can recover the right support, whereas by construction G (·) using the same groups group would be unable to recover it. The model is learned from n data points (xi , yi ), with yi = w xi + , N (0, 2 ), = |E(Xw + b)|. Using an 2 loss R(w) = Y - Xw - b 2 , we learn models from 50 such training sets. On Figure 2, for each variable (on the vertical axis), we plot its frequency of selection in levels of gray as a function of the regularization parameter , both for the lasso penalty and G overlap (.). ¯ g R(w) gJ1 -1J1 J1 (w) - op (1). J1 7. Graph lasso We now consider the situation where we have a simple undirected graph (I, E), where the set of vertices I = [1, k] is the set of covariates and E I × I is a set of edges that connect covariates. We suppose that we wish to estimate a sparse model such that selected covariates tend to be connected to each other, i.e., form a limited number of connected components on the graph. An obvious approach is to consider the prior G overlap (.) where G is a set that generates by union the connected components. For example, we may consider for G the set of edges, cliques, or small linear subgraphs. As an example, considering all edges, i.e., G = E leads to graph (w) = s.t. eE ve = w, supp (ve ) = e . minvVE eE ve Alternatively, we will consider in the experiments the set of all linear subgraphs of length k 1. Although we have Group Lasso with Overlap and Graph Lasso 20 40 60 80 log2() 20 40 60 80 log2() X, the lasso selects one and is less robust than G overlap (.) which uses all the variables. Note that when enough training points become available (last point on Figure 3), Figure 2 shows that the selected model is generally better but still not correct whereas G overlap (.) selects the right model, even if it does not give much lower error anymore. 9.2. Synthetic data: given linear graph structure 20 40 60 80 log2() 20 40 60 80 log2() Figure 2. Frequency of selection of each variable with the lasso (left) and G overlap (.) (right) for n = 50 (top) and 100 (bottom). For any choice of the lasso frequently misses some variables from the support, while G overlap (.) never misses any variable from the support for a large part of the regularization path. Besides, we observed that over the replicates, the lasso never selected the exact correct pattern for n < 100. For n = 100, the right pattern was selected with low frequency on a small part of the regularization path. G overlap (.) on the other hand selected it up to 92% of the times for n = 50 and more than 99% on more than one third of the path for n = 100. We tried the same experiment for various n and as long as n was too small for the lasso to recover the right support, the group regularization always helped. 10 8 RMSE 6 4 2 0 1 1.5 log (n) 10 We now consider that the prior given on the variables is a graph structure and that we are interested by solutions which are connected components on this graph. As a first simple illustration, we consider a chain. We use w Rp , p = 100, supp (w) = [20, 40]. The nodes of the graph are the variables wi , the edges are all the pairs (wi , wi+1 ), i = 1, . . . , n. The model's weights, offset and the 50 training examples (x, y) are drawn using the same protocol as in the previous experiment. We take for the groups all the sub-chains of length k. We present the results for various choices of k and compare to the lasso (k = 1). 20 40 60 80 100 log2() 20 40 60 80 100 log2() 20 40 60 20 40 60 80 100 log () 2 overlapping lasso 80 100 log () 2 2 Figure 4. Variable selection frequency with G overlap (.) using the chains of length k (left) as groups, for k = 1, 2, 4, 8. Figure 3. Root mean squared error of overlapped group lasso and lasso as a function of the number of training points. Figure 4 shows the frequency of each variable selection over 20 replications. Here again, using a group prior helps the pattern recovery. We also observe as expected that the choice of k plays a role in the improvement. 9.3. Synthetic data: given non-linear graph structure Here we consider the same setting as in the linear case, except that instead of a chain we are given a grid structure on the variables. Each node is connected to the 4 nodes above, below, left and right. The support is a 20-variable region in the center of the grid, x-axis 4 to 7, y-axis 4 to 8. As groups, we use all the 4-cycles, which is a natural prior given the graph topology and the expected pattern. Figure 3 shows the root mean squared error of both methods for various n. For both methods, the full regularization path is computed and tested on three replicates of n training and 100 testing points. The best average parameter is selected and used to train and test a model on a fourth replicate. On a large range of n, G overlap (.), not only helps to recover the right pattern, improves the regression performance. A possible explanation is that if several variables from the support are correlated in the design matrix Group Lasso with Overlap and Graph Lasso Figure 5 shows the variable selection frequency of each variable for both methods at a fixed (chosen in both cases to give the best behavior). G overlap (.) seems to generally give better selection performances than lasso. Besides, we observed that on each run, variables incorrectly selected where always unions of groups whereas the lasso selected disconnected variables on the graph. We made the same observation for the linear graph case. This is an expected property of our method, and implies that even if variables which are not in the model are selected, they enter the model as large connected components, whereas the false positive of the lasso are more randomly distributed on the graph, often as isolated variables. This is an interesting property for real applications because it may then be easier to discard manually a few large connected components of false positives, than many isolated variables (assuming of course that the right variables are selected as well). Table 1. Classification error, number and proportion of pathways selected by the 1 and G overlap (.) on the 3 folds. M ETHOD E RROR PATH . P ROP. PATH . 1 0.38 ± 0.04 148, 58, 183 0.32, 0.14, 0.41 G OVERLAP (.) 0.36 ± 0.03 6, 5, 78 0.01, 0.01, 0.17 We estimate by 3-fold cross validation the accuracy of a logistic regression with 1 and G overlap (.) penalties, using the pathways as groups. As a pre-processing, we keep the 300 genes most correlated with the output (on each training set). is selected by cross validation on each training set. Table 1 shows the results of both methods. Using G overlap (.) instead of the 1 penalty leads to a slight improvement in the prediction performances, and much sparser solutions at the pathway level, which makes the selected model easier to interpret. 9.5. Breast cancer data: graph analysis Another important application in microarray data analysis is the search for potential drug targets. In order to identify genes which are related to a disease, one would like to find groups of genes forming connected components on a graph carrying biological information such as regulation, involvement in the same chain of metabolic reactions, or protein-protein interaction. Similarly to what is done in pathway analysis, (Chuang et al., 2007) built a network by compiling several biological networks and performed such graph analysis by identifying discriminant subnetworks in one step and using these subnetworks to learn a classifier in a separate step. We use this network and the approach described in section 7, taking all the edges on the network as the groups, on the breast cancer dataset. Here again, we restrict the data to the 7910 genes which are present in the network, and use the same correlation-based preprocessing as for the pathway analysis. Table 2 shows the results of the logistic regression with 1 and G overlap (.). Here again, both methods give similar performances, with a slight advantage for G overlap (.). On the other hand, while the 1 mostly selects disconnected variables on the graph, G overlap (.) tends to select variables which are grouped into larger connected components on the graph. This would make the interpretation and the search for new drug targets easier. 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Figure 5. Grid view of the variable selection frequencies with the graph setting. Left: lasso, right: G overlap (.) using 4-cycles as groups. n = 30 training points, is arbitrarily fixed. 9.4. Breast cancer data: pathway analysis An important motivation for our method is the possibility to perform gene selection from microarray data using priors which are overlapping groups. For example, one may want to analyse microarrays in terms of biologically meaningful gene sets. In most such analysis, genes discriminating the classes (e.g. tumors leading to metastasis versus nonmetastasis) are selected in a first step, then enrichment analysis is performed by looking for gene sets in which selected genes are overrepresented (Subramanian et al., 2005). Several organizations of the genes into gene sets are available in various databases. We use the canonical pathways from MSigDB (Subramanian et al., 2005) containing 639 groups of genes, 637 of which involve genes from our study. We use the breast cancer dataset compiled by (Van de Vijver et al., 2002), which consists of gene expression data for 8, 141 genes in 295 breast cancer tumors (78 metastatic and 217 non-metastatic). We restrict the analysis to the 3510 genes which are in at least one pathway. Since the dataset is very unbalanced, we balance it by using 3 replicates of each metastasis patient (keeping all duplicates in the same fold during cross-validation). 10. Discussion We have presented a generalization of the group lasso penalty, which leads to sparse models with sparsity pat- Group Lasso with Overlap and Graph Lasso Table 2. Classification error and average size of the connected components selected by the 1 and G overlap (.) on the 3 folds. M ETHOD E RROR AV. SIZE C . C . 1 0.39 ± 0.04 1.1, 1, 1.0 G OVERLAP (.) 0.36 ± 0.01 1.3, 1.4, 1.2 Fu, W., & Knight, K. (2000). Asymptotics for Lasso-type estimators. Ann. Stat., 28, 1356­1378. Jenatton, R., Audibert, J.-Y., & Bach, F. (2009). Structured Variable Selection with Sparsity-Inducing Norms. INRIA - Ecole Normale Supérieure de Paris. Kumagai, S. (1980). An implicit function theorem: Comment. J. Optim. Theor. Appl., 31, 285­288. Meier, L., van de Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. J. Roy. Stat. Soc. B, 70, 53­71. Obozinski, G., Taskar, B., & Jordan, M. (2009). Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput.. To appear. Rakotomamonjy, A., Bach, F., Canu, S., & Grandvalet, Y. (2008). SimpleMKL. J. Mach. Learn. Res., 9, 2491­ 2521. Rapaport, F., Zynoviev, A., Dutreix, M., Barillot, E., & Vert, J.-P. (2007). Classification of microarray data using gene networks. BMC Bioinformatics, 8, 35. Roth, V. (2002). The generalized lasso: a wrapper approach to gene selection for microarray data. Proc. Conference on Automated Deduction 14, 252­255. Roth, V., & Fischer, B. (2008). The group-lasso for generalized linear models: uniqueness of solutions and efficient algorithms. Int. Conf. Mach. Learn., 848­855. Subramanian, A., et al., (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA, 102, 15545­15550. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc. B., 58, 267­288. Van de Vijver, M. J., et al., (2002). A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med., 347, 1999­2009. Wainwright, M. J. (2006). Sharp thresholds for highdimensional and noisy recovery of sparsity (Technical Report 709). UC Berkeley, Department of Statistics. Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B, 68, 49­67. Zhao, P., Rocha, G., & Yu, B. (2009). Grouped and hierarchical model selection through composite absolute penalties. Ann. Stat. To appear. Zhao, P., & Yu, B. (2006). On model selection consistency of lasso. J. Mach. Learn. Res., 7, 2541­2563. terns that are unions of pre-defined groups of covariates, or, given a graph of covariates, groups of connected covariates in the graph. We obtained promising results on both simulated and real data. From a theoretical point of view, we gave both sufficient and necessary conditions for the correct recovery of the same union of groups as in the decomposition induced by G overlap (·) on the true optimal parameter vector. It still remains to characterize when the latter decomposition has the smallest number of groups. The situation where several decompositions exist should be analyzed. Also, the construction of an adaptive version of the Group Lasso with overlap that could possibly generalize the scheme proposed by (Bach, 2008) would be of interest. From a practical point of view, although algorithms for the standard group Lasso can be used to implement G overlap (·), more dedicated and scalable algorithms could be designed for cases with large overlaps. Future work should compare more systematically G G overlap (·) and group (·) empirically and theoretically. Acknowledgments This work was supported by ANR grant ANR-07-BLAN0311 and the France-Berkeley fund. The authors thank Bin Yu and Michael Jordan for useful discussions. References Bach, F. (2008). Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res., 9, 1179­1225. Bach, F. (2009). Exploring large feature spaces with hierarchical multiple kernel learning. Adv. Neural. Inform. Process Syst., 105­112. Chen, S. S., Donoho, D. L., & Saunders, M. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., 20, 33­61. Chuang, H.-Y., Lee, E., Liu, Y.-T., Lee, D., & Ideker, T. (2007). Network-based classification of breast cancer metastasis. Mol. Syst. Biol., 3, 140.