Joint support recovery under high-dimensional scaling: Benefits and perils of 1,-regularization Sahand Negahban Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720-1770 sahand n@eecs.berkeley.edu Martin J. Wainwright Department of Statistics, and Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720-1770 wainwrig@eecs.berkeley.edu Abstract Given a collection of r 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This set-up suggests the use of 1/ -regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the high-dimensional scaling of 1/ -regularized quadratic programming, considering both consistency rates in -norm, and also how the minimal sample size n required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. These results show that the high-dimensional scaling of 1/ -regularization is qualitatively similar to that of ordinary 1-regularization. Our second set of results applies to design matrices drawn from standard Gaussian ensembles, for which we provide a sharp set of necessary and sufficient conditions: the 1/ -regularized method undergoes a phase transition characterized by the rescaled sample size 1, (n, p, s, ) = n/{(4 - 3)s log(p - (2 - ) s)}. More precisely, for any > 0, the probability of successfully recovering both supports converges to 1 for scalings such that 1, 1 + , and converges to 0 for scalings for which 1, 1 - . An implication of this threshold is that use of 1, -regularization yields improved statistical efficiency if the overlap parameter is large enough ( > 2/3), but performs worse than a naive Lasso-based approach for moderate to small overlap ( < 2/3). We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. 1 Introduction The area of high-dimensional statistical inference is concerned with the behavior of models and algorithms in which the dimension p is comparable to, or possibly even larger than the sample size n. In the absence of additional structure, it is well-known that many standard procedures--among them linear regression and principal component analysis--are not consistent unless the ratio p/n converges to zero. Since this scaling precludes having p comparable to or larger than n, an active line of research is based on imposing structural conditions on the data--for instance, sparsity, manifold constraints, or graphical model structure--and then studying conditions under which various polynomial-time methods are either consistent, or conversely inconsistent. 1 This paper deals with high-dimensional scaling in the context of solving multiple regression problems, where the regression vectors are assumed to have shared sparse structure. More specifically, suppose that we are given a collection of r different linear regression models in p dimensions, with regression vectors i Rp , for i = 1, . . . , r. We let S ( i ) = {j | ji = 0} denote the support set of i . In many applications--among them sparse approximation, graphical model selection, and image reconstruction--it is natural to impose a sparsity constraint, corresponding to restricting the cardinality |S ( i )| of each support set. Moreover, one might expect some amount of overlap between the sets S ( i ) and S ( j ) for indices i = j since they correspond to the sets of active regression coefficients in each problem. For instance, consider the problem of image denoising or reconstruction, using wavelets or some other type of multiresolution basis. It is well known that natural images tend to have sparse representations in such bases. Moreover, similar images--say the same scene taken from multiple cameras--would be expected to share a similar subset of active features in the reconstruction. Similarly, in analyzing the genetic underpinnings of a given disease, one might have results from different subjects and/or experiments, meaning that the covariate realizations and regression vectors would differ in their numerical values, but one expects the same subsets of genes to be active in controlling the disease, which translates to a condition of shared support in the regression coefficients. Given these structural conditions of shared sparsity in these and other applications, it is reasonable to consider how this common structure can be exploited so as to increase the statistical efficiency of estimation procedures. In this paper, we study the high-dimensional scaling of block 1/ regularization. Our main contribution is to obtain some precise--and arguably surprising--insights into the benefits and dangers of using block 1/ regularization, as compared to simpler 1-regularization (separate Lasso for each regression problem). We begin by providing a general set of sufficient conditions for consistent support recovery for both fixed design matrices, and random Gaussian design matrices. In addition to these basic consistency results, we then seek to characterize rates, for the particular case of standard Gaussian designs, in a manner precise enough to address the following questions. (a) First, under what structural assumptions on the data does the use of 1/ block-regularization provide a quantifiable reduction in the scaling of the sample size n, as a function of the problem dimension p and other structural parameters, required for consistency? (b) Second, are there any settings in which 1/ tationally less expensive procedures? block-regularization can be harmful relative to compu- Answers to these questions yield useful insight into the tradeoff between computational and statistical efficiency. Indeed, the convex programs that arise from using block-regularization typically require a greater computational cost to solve. Accordingly, it is important to understand under what conditions this increased computational cost guarantees that fewer samples are required for achieving a fixed level of statistical accuracy. As a representative instance of our theory, consider the special case of standard Gaussian design matrices and two regression problems (r = 2), with the supports S ( 1 ) and S ( 2 ) each of size s and overlapping in a fraction [0, 1] of their entries. For this problem, we prove that block 1/ regularization undergoes a phase transition in terms of the rescaled sample size 1, (n, p, s, ) : = n . (4 - 3)s log(p - (2 - )s) (1 ) In words, for any > 0 and for scalings of the quadruple (n, p, s, ) such that 1, 1 + , the probability of successfully recovering both S ( 1 ) and S ( 2 ) converges to one, whereas for scalings such that 1, 1 - , the probability of success converges to zero. By comparison to previous theory on the behavior of the Lasso (ordinary 1-regularized quadratic programming), the scaling (1) has two interesting implications. For the ssparse regression problem with standard Gaussian designs, the Lasso has been shown [10] to undergo a phase transition as a function of the rescaled sample size Las (n, p, s) : = n , 2s log(p - s) (2 ) so that solving two separate Lasso problems, one for each regression problem, would recover both supports for problem sequences (n, p, s) such that Las > 1. Thus, one consequence of our analysis is to provide a precise confirmation of the natural intuition: if the data is well-aligned with the regularizer, then block-regularization increases statistical efficiency. On the other hand, our analysis also conveys a cautionary message: if the overlap is too small--more precisely, if < 2/3--then block 1, is actually harmful relative to the naive Lasso-based approach. This fact illustrates that some care is required in the application of block regularization schemes. 2 The remainder of this paper is organized as follows. In Section 2, we provide a precise description of the problem. Section 3 is devoted to the statement of our main result, some discussion of its consequences, and illustration by comparison to empirical simulations. 2 Problem set-up We begin by setting up the problem to be studied in this paper, including multivariate regression and family of block-regularized programs for estimating sparse vectors. 2.1 Multivariate regression In this problem, we consider the following form of multivariate regression. For each i = 1, . . . , r, let i Rp be a regression vector, and consider the family of linear observation models y i = X i i + wi , i = 1, 2, . . . , r. (3 ) i n×p He r e e a c h X R is a design matrix, possibly different for each vector i , and wi Rn is a noise vector. We assume that the noise vectors wi and wj are independent for different regression problems i = j . In this paper, we assume that each wi has a multivariate Gaussian N (0, 2 In×n ) distribution. However, we note that qualitatively similar results will hold for any noise distribution with sub-Gaussian tails (see the book [1] for m o r e b ack g r o u n d ) . 2.2 Block-regularization schemes For compactness in notation, we frequently use B to denote the p × r matrix with i Rp as the ith column. Given a parameter q [1, ], we define the 1/ q block-norm as follows: kp 1 2 r (k , k , . . . , k ) q, (4 ) B 1/ q : = =1 corresponding to applying the q norm to each row of B , and the 1-norm across all of these blocks. We note that all of these block norms are special cases of the CAP family of penalties [12]. This family of block-regularizers (4) suggests a natural family of M -estimators for estimating B , based on solving the block- 1/ q -regularized quadratic program 1 ir , B arg min (5 ) y i - X i i 2 + n B 1/ q 2 p ×r 2 n =1 B R where n > 0 is a user-defined regularization parameter. Note that the data term is separable across the different regression problems i = 1, . . . , r, due to our assumption of independence on the noise vectors. Any coupling between the different regression problems is induced by the block-norm regularization. In the special case of univariate regression (r = 1), the parameter q plays no role, and the block-regularized scheme (6) reduces to the Lasso [7, 3]. If q = 1 and r 2, the block-regularization function (like the data term) is separable across the different regression problems i = 1, . . . , r, and so the scheme (6) reduces to solving r separate Lasso problems. For r 2 and q = 2, the program (6) is frequently referred to as the group Lasso [11, 6]. Another important case [9, 8], and the focus of this paper, is block 1/ regularization. The motivation for using block 1/ regularization is to encourage shared sparsity among the columns of the regression matrix B . Geometrically, like the 1 norm that underlies the ordinary Lasso, the 1/ block norm has a polyhedral unit ball. However, the block norm captures potential interactions between the columns i 1 2 r in the matrix B . Intuitively, taking the maximum encourages the elements (k , k . . . , k ) in any given row i k = 1, . . . , p to be zero simultaneously, or to both be non-zero simultaneously. Indeed, if k = 0 for at least j j i one i {1, . . . , r}, then there is no additional penalty to have k = 0 as well, as long as |k | |k |. 2.3 Support recovery For a given n > 0, suppose that we solve the block 1/ program, thereby obtaining an estimate 1 ir , B arg min y i - X i i 2 + n B 1/ 2 B Rp ×r 2 n =1 3 (6 ) In this paper, we study the accuracy of the estimate B , as a function of the sample size n, regression dimensions p and r, and the sparsity index s = maxi=1,...,r |S ( i )|. There are various metrics with which to assess the "closeness" of the estimate B to the truth B , including predictive risk, various types of norm-based bounds on the difference B - B , and variable selection consistency. In this paper, we prove results on support recovery i criteria. Recall that for each vector i Rp , we use S ( i ) = {k | k = 0} to denote its support set. The problem of union support recovery corresponds to recovering the set J := ir S ( i ), (7 ) We note that under high-dimensional scaling (p n), this convex program (6) is not necessarily strictly convex, since the quadratic term is rank deficient and the block 1/ norm is polyhedral, which implies that the program is not strictly convex. However, a consequence of our analysis is that under appropriate conditions, the optimal solution B is in fact unique. =1 corresponding to the subset J {1, . . . , p} of indices that are active in at least one regression problem. Note that the cardinality of |J | is upper bounded by rs, but can be substantially smaller (as small as s) if there is overlap among the different supports. In some results, we also study the more refined criterion of recovering the individual signed supports, meaning i the signed quantities sign(k ), where the sign function is given by sign(t) +1 if t > 0 = 0 if t = 0 -1 if t < 0 (8 ) There are multiple ways in which the support (or signed support) can be estimated, depending on whether we use primal or dual information from an optimal solution. 1/ primal recovery: Solve the block-regularized program (6), thereby obtaining a (primal) optimal solution B Rp×r , and estimate the signed support vectors [Spri ( i )]k = i sign(k ). (9 ) 1/ dual recovery: Solve the block-regularized program (6), thereby obtaining a primal solution B i Rp×r . For each row k = 1, . . . , p, compute the set Mk : = arg max |k |. Estimate the signed support via: i=1,...,r i [Sdua (k )] = s i ign(k ) if i Mk 0 otherwise. (1 0 ) As our development will clarify, this procedure corresponds to estimating the signed support on the basis of a dual optimal solution associated with the optimal primal solution. 2.4 Notational conventions Throughout this paper, we use the index p {1, . . . , r} as a superscript in indexing the different regression problems, or equivalently the columns of the matrix B Rp×r . Given a design matrix X Rn×p and a subset S {1, . . . , p}, we use XS to denote the n × |S | sub-matrix obtained by extA cting :those columns indexed by ra S . For a pair of matrices A Rm× and B Rm×n , we use the notation , B = AT B for the resulting × n matrix. We use the following standard asymptotic notation: for functions f , g , the notation f (n) = O(g (n)) means that there exists a fixed constant 0 < C < + such that f (n) C g (n); the notation f (n) = (g (n)) means that f (n) C g (n), and f (n) = (g (n)) means that f (n) = O(g (n)) and f (n) = (g (n)). 4 3 Main results and their consequences In this section, we provide precise statements of the main results of this paper. Our first main result (Theorem 1) provides sufficient conditions for deterministic design matrices X 1 , . . . , X r . This result allows for an arbitrary number r of regression problems. Not surprisingly, these results show that the high-dimensional scaling of block 1/ is qualitiatively similar to that of ordinary 1-regularization: for instance, in the case of random Gaussian designs and bounded r, our sufficient conditions in [5] ensure that n = (s log p) samples are sufficient to recover the union of supports correctly with high probability, which matches known results on the Lasso [10]. As discussed in the introduction, we are also interested in the more refined question: can we provide necessary and sufficient conditions that are sharp enough to reveal quantitative differences between ordinary 1regularization and block regularization? In order to provide precise answers to this question, our final two results concern the special case of r = 2 regression problems, both with supports of size s that overlap in a fraction of their entries, and with design matrices drawn randomly from the standard Gaussian ensemble. In this setting, our final two results (Theorem 2 and 3) show that block 1/ regularization undergoes a phase transition specified by the rescaled sample size. We then discuss some consequences of these results, and illustrate their sharpness with some simulation results. 3.1 Sufficient conditions for deterministic designs In addition to the sample size n, problem dimensions p and r, and sparsity index s, our results are stated in terms 1i 1 i i i of the minimum eigenvalue Cmin of the |J | × |J | matrices n XJ , XJ --that is, min n XJ , XJ Cmin 1i 1 i for all i = 1, . . . , r, as well as an -operator norm of their inverses: ||| n XJ , XJ - ||| Dmax for all i = 1, . . . , r. It is natural to think of these quantites as being constants (independent of p and s), although our results do allow them to scale. We assume that the columns of each design matrix X i , i = 1, . . . , r are normalized so that i Xk 2 2 2n for all k = 1, 2, . . . p. (1 1 ) The choice of the factor 2 in this bound is for later technical convenience. We also require the following incoherence condition on the design matrix is satisified: there exists some (0, 1] such that =1,...,|J c | max ir =1 i i i X i , XJ ( XJ , XJ )-1 1 (1 - ) (1 2 ) For a parameter > 1 (to be chosen by the user), we define the probability 1 ( , p, s) : = 1 - 2 exp(-( - 1)[r + log p]) - 2 exp(-( 2 - 1) log(rs)) (1 3 ) which specifies the precise rate with which the "high probability" statements in Theorem 1 hold. Theorem 1. Consider the observation model (3) with design matrices X i satisfying the column bound (11) and incoherence condition (12) Suppose that we solve the block-regularized 1/ convex program (6) with regu22 l larization parameter 2 4 r +rnog(p) for some > 1. Then with probability greater than 1 ( , p, s) 1, n 2 we are guaranteed that: r (a) The block-regularized program has a unique solution B such that i=1 S ( i ) J , and it satisfies the elementwise bound 4 2 log(rs) i i + Dmax n (1 4 ) max max |k - k | i=1,...,r k=1,...,p Cmin n . b1 ( , n , n, s) r 1 (b) If in addition Bmin 2 b1 ( , n , n, s), then i=1 S ( i ) = J , so that the solution B correctly specifies the union of supports J . Moreover, using the primal signed support recovery method (9), we have [Spri ( i )]k i = sign(k ) for all k S ( i ). (1 5 ) 5 The dual signed support recovery method (10) is more conservative in estimating the individual support sets. In particular, for any given i {1, . . . , r}, it only allows an index k to enter the signed support estimate i Sdua ( i ) when |k | achieves the maximum magnitude (possibly non-unique) across all indices i = 1, . . . , r. Consequently, Theorem 1 guarantees that the dual signed support method will never include an index in the individual supports. However, it may incorrectly exclude indices of some supports, but like the primal support estimator, it is always guaranteed to correctly recover the union of supports J . We note that it is possible to ensure that under some conditions that the dual support method will correctly recover each of the the individual signed supports, without any incorrect exclusions. However, as illustrated j i by Theorem 2, doing so requires additional assumptions on the size of the gap |k | - |k | for indices k S ( i ) S ( j ). 3.2 Sharp results for standard Gaussian ensembles Remarks: To clarify the scope of the claims, part (a) guarantees that the estimator recovers the union support i J correctly, whereas part (b) guarantees that for any given i = 1, . . . , r and k S ( i ), the sign sign(k ) is i correct. Note that we are guaranteed that k = 0 for all k J . However, within the union support J , when / using primal recovery method, it is possible to have false non-zeros--i.e., there may be an index k J \S ( i ) i such that k = 0. Of course, this cannot occur if the support sets S ( i ) are all equal. This phenomenon is j related to geometric properties of the block 1/ norm: in particular, for any given index k , when k = 0 for i some j {1, . . . , r}, then there is no further penalty to having k = 0 for other column indices i = j . Our results thus far show under standard mutual incoherence or irrepresentability conditions, the block 1/ method produces consistent estimators for n = (s log(p - s)). In qualitative terms, these results match known scaling for the Lasso, or ordinary 1-regularization. In order to provide keener insight into the (dis)advantages associated with using 1/ block regularization, we specialize the remainder of our analysis to the case of r = 2 regression problems, where the corresponding design matrices X i , i = 1, 2 are sampled from the standard Gaussian ensemble [2, 4]--i.e., with i.i.d. rows N (0, Ip×p ). Our goal in studying this special case is to be able to make quantiative comparisons with the Lasso. We consider a sequence of models indexed by the triplet (p, s, ), corresponding to the problem dimension p, support sizes s, and overlap parameter [0, 1]. We assume that s p/2, capturing the intuition of a (relatively) sparse model. Suppose that for a given model, we take n = n(p, s, ) observations. according to equation (3). We can then study the probability of successful recovery as a function of the model triplet, and the sample size n. In order to state our main result, we define the order parameter or rescaled sample size 1, (n, p, s, ) : = (4-3)s lognp-(2-)s) . Letting V = S ( 1 ) S ( 2 ) we also define the support minimum value Bmin : = ( i 1 2 mini=1,2 miniV |k |, and the gap vector Bgap : = |k | - |k | 3.2.1 Sufficient conditions We begin with a result that provides sufficient conditions for support recovery using block 1/ regularization. Theorem 2 (Achievability). Given the observation model (3) with random design X drawn with i.i.d. standard Gaussian entries, and consider problem sequences (n, p, s, ) for whilch 1, (n, p, s, ) > 1 + for some > 0. If we solve the block-regularized program (6) with n = 1 - c1 exp(-c2 log(p - (2 - )s)), the following properties hold: og p n, then with probability greater than (ii) If the support minimum Bmin > b3 ( , n , n, s), then the primal support method correctly recovers the support union J = S ( 1 ) S ( 2 ). 6 (i) The block 1, -program (6) has a unique solution ( 1 , 2 ), with supports S ( 1 ) J and S ( 2 ) J . Moreover, we have the elementwise bound 1 4s , 00 log(s) i i max max |k - k | + n + 1 16) i=1,2 k=1,...,p n n ( b3 ( , n , n, s) (iii) If 1 n Bgap 0, then the dual support method correctly recovers the signed support 3.2.2 Necessary conditions We now turn to the question of finding matching necessary conditions for support recovery. Theorem 3 (Lower bounds). Given the observation model (3) with random design X drawn with i.i.d. standard Gaussian entries. (a) For problem sequences (n, p, s, ) such that 1, (n, p, s, ) < 1 - for some > 0 and for any non-increasing regularization sequence n > 0, no solution B = ( 1 , 2 ) to the block-regularized program (6) has the correct support union S ( 1 ) S ( 2 ). (b) Recalling the definition of Bgap , define the rescaled gap limit c(n , Bgap ) ( := 4- lim sup(n,p,s) Bgap 2/(n s). If the sample size n is bounded as n < (1 - ) 3) + c(n )]s log[p - (2 - )s] for some > 0, then the dual recovery method (10) fails to recover the individual signed supports. In summary, Theorems 2 and 3(a) show that the block-regularized program (6) has a sharp threshold for recovering the joint support J = S ( 1 ) S ( 2 ), characterized by the order parameter 1, . Theorem 3(b) shows that unless Bgap 2/(n s) vanishes as (n, p, s) increase, then the sample size required for recovering the individual supports is even larger. 3.3 Illustrative simulations and some consequences In this section, we provide some illustrative simulations that illustrate the phase transitions predicted by Theorems 2 and 3, and show that the theory provides an accurate description of practice even for relatively small problem sizes (e.g., p = 128). Figure 1 plots the probability of successful recovery of the individual signed supports using dual support recovery (10)--namely, P[Sdua ( i ) = S± ( i ), Sdua ( 2 ) = S± ( 2 )] for i = 1, 2-- versus the order parameter 1, (n, p, s, ). The plot contains four sets of "stacked" curves, each corresponding to a different choice of the overlap parameter, ranging from = 1 (left-most stack), to = 0.1 (right-most stack). Each stack contains three curves, corresponding to the problem sizes p {128, 256, 512}. In all cases, we fixed the support size s = 0.1p. The stacking behavior of these curves demonstrates that we have isolated the correct order parameter, and the step-function behavior is consistent with the theoretical predictions of a sharp threshold. Theorems 2 and 3 have some interesting consequences, particularly in comparison to the behavior of the "naive" Lasso-based individual decoding of signed supports--that is, the method that simply applies the Lasso (ordinary 1-regularization) to each column i = 1, 2 separately. By known results [10] on the Lasso, the performance of this naive approach is governed by the order parameter n Las (n, p, s) = , (1 7 ) 2s log(p - s) meaning that for any > 0, it succeeds for sequences such that Las > 1 + , and conversely fails for sequences such that Las < 1 - . To compare the two methods, we define the relative efficiency coefficient R(1, , Las ) : = Las (n, p, s)/1, (n, p, s, ). A value of R < 1 implies that the block method is more efficient, while R > 1 implies that the naive method is more efficient. With this notation, we have the following: Corollary 1. The relative efficiency of the block 1, program (6) compared to the Lasso is given by p- R(1, , Las ) = 4-3 log(og((2-)s) . Thus, for sublinear sparsity s/p 0, the block scheme has greater 2 l p-s) statistical efficiency for all overlaps (2/3, 1], but lower statistical efficiency for overlaps [0, 2/3). Remarks: We have demonstrated necessary and sufficient conditions under which the 1, -regularization term is able to perform successful recovery in two regression problems. The results of Theorems 2 and 3 show that the statistical performance of solving (6) is sensitive to the support overlap of the vectors as well as the difference between the coefficients on the joint support sets. Thus, depending on the structure of the problem, there will be statistical gains or even losses when using the 1, -regularization for solving the joint regression problem compared to using Lasso separately on each regression problem. 7 1, relaxation for s = 0.1*p and = 1, 0.7, 0.4, 0.1 1 0.8 Prob. success =1 0.6 = 0.7 = 0.4 = 0.1 0.4 0.2 0 0 1 2 3 Control parameter 4 p = 128 p = 256 p = 512 5 Figure 1. Probability of success in recovering the joint signed supports plotted against the control parameter 1, = n/[2s log(p - (2 - )s))]. for linear sparsity s = 0.1p. Each stack of graphs corresponds to a fixed overlap , as labeled on the figure. The three curves within each stack correspond to problem sizes p{128, 256, 512}; note how they all align with each other and exhibit step-like behavior, consistent with Theorems 2 and 3. The vertical lines correspond to the thresholds 1, () predicted by Theorems 2 and 3; note the close agreement between theory and simulation. References [1] V. V. Buldygin and Y. V. Kozachenko. Metric characterization of random variables and random processes. American Mathematical Society, Providence, RI, 2000. [2] E. Candes and T. Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 2006. [3] S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci. Computing, 20(1):33­61, 1998. [4] D. L. Donoho and J. M. Tanner. Counting faces of randomly-projected polytopes when the projection radically lowers dimension. Technical report, Stanford University, 2006. Submitted to Journal of the AMS. [5] S. Negahban and M. J. Wainwright. Joint support recovery under high-dimensional scaling: Benefits and perils of 1, -regularization. Technical report, Department of Statistics, UC Berkeley, November 2008. [6] G. Obozinski, B. Taskar, and M. Jordan. Joint covariate selection for grouped classification. Technical report, Statistics Department, UC Berkeley, 2007. [7] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267­288, 1996. [8] J. A. Tropp, A. C. Gilbert, and M. J. Strauss. Algorithms for simultaneous sparse approximation. Signal Processing, 86:572­602, April 2006. Special issue on "Sparse approximations in signal and image processing". [9] B. Turlach, W.N. Venables, and S.J. Wright. Simultaneous variable selection. Technometrics, 27:349­363, 2005. [10] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy recovery of sparsity using using 1constrained quadratic programs. Technical Report 709, Department of Statistics, UC Berkeley, 2006. [11] Kim Y., Kim J., and Y. Kim. Blockwise sparse regression. Statistica Sinica, 16(2), 2006. [12] P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite absolute penalties. Technical report, Statistics Department, UC Berkeley, 2007. 8