Gradient Descent with Sparsification: An iterative algorithm for sparse recovery with restricted isometry property

Rahul Garg grahul@us.ibm.com Rohit Khandekar rohitk@us.ibm.com IBM T. J. Watson Research Center, 1101 Kitchawan Road, Route 134, Yorktown Heights, NY 10598

Abstract
We present an algorithm for finding an ssparse vector x that minimizes the squareerror y - x 2 where  satisfies the restricted isometry property (RIP), with isometric constant 2s < 1/3. Our algorithm, called GraDeS (Gradient Descent with Sparsification) iteratively updates x as: x  Hs x + 1 ·  (y - x) 

deconvolution and de-noising (Figueiredo & Nowak, 2005) and compressed sensing (Cand`s & Wakin, e 2008). The recent results in the area of compressed sensing, especially those relating to the properties of random matrices (Cand`s & Tao, 2006; Cand`s et al., e e 2006) has exploded the interest in this area which is finding applications in diverse domains such as coding and information theory, signal processing, artificial intelligence, and imaging. Due to these developments, efficient algorithms to find sparse solutions are increasingly becoming very important. Consider a system of linear equations of the form y = x (1)

where  > 1 and Hs sets all but s largest magnitude coordinates to zero. GraDeS converges to the correct solution in constant number of iterations. The condition 2s < 1/3 is most general for which a near-linear time algorithm is known. In comparison, the best condition under which a polynomial time algorithm is known, is 2s < 2 - 1. Our Matlab implementation of GraDeS outperforms previously proposed algorithms like Subspace Pursuit, StOMP, OMP, and Lasso by an order of magnitude. Curiously, our experiments also uncovered cases where L1-regularized regression (Lasso) fails but GraDeS finds the correct solution.

where y  m is an m-dimensional vector of "measurements", x  n is the unknown signal to be reconstructed and   m×n is the measurement matrix. The signal x is represented in a suitable (possibly overcomplete) basis and is assumed to be "s-sparse" (i.e. at most s out of n components in x are non-zero). The sparse reconstruction problem is
x ~

min ||~||0 subject to y = ~ x x n

(2)

where ||~||0 represents the number of non-zero entries x in x. This problem is not only NP-hard (Natarajan, ~ 1995), but also hard to approximate within a factor 1- O(2log (m) ) of the optimal solution (Neylon, 2006). RIP and its implications. The above problem becomes computationally tractable if the matrix  satisfies a restricted isometry property. Define isometry constant of  as the smallest number s such that the following holds for all s-sparse vectors x  n (1 - s ) x
2 2

1. Introduction
Finding a sparse solution to a system of linear equations has been an important problem in multiple domains such as model selection in statistics and machine learning (Golub & Loan, 1996; Efron et al., 2004; Wainwright et al., 2006; Ranzato et al., 2007), sparse principal component analysis (Zou et al., 2006), image
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

 x

2 2

 (1 + s ) x

2 2

(3)

It has been shown (Cand`s & Tao, 2005; Cand`s, 2008) e e that if y = x for some s-sparse vector x and 2s <  2 - 1 or s + 2s + 3s < 1, then the solution to the

Gradient Descent with Sparsification

program (2) is equivalent to the solution to the linear program (4) given below.
x ~

multiple of mn are needed. Runtime in practice. A good asymptotic theoretical bound is not very useful if the runtime constants are very big. From a practical perspective, it is very important to quantify the exact number of steps needed by the algorithm (e.g., number of iterations, number of floating point operations per iteration). Guarantees on recovery. It is important to understand the conditions under which the algorithm is guaranteed to find the correct solution. While some algorithms do not offer any such guarantee, some others can possibly find the correct solution if some conditions on the RIP constants s , 2s , 3s , . . . are satisfied. A comparison of a few key algorithms for sparse recovery is presented in Table 1. For a meaningful comparison, an attempt is made to report the exact constants in the runtime of these algorithms. However, when the constants are large or are not available, the order notation O(.) is used. The encoding/recovery techniques that design the measurement matrix , are fast and guarantee exact recovery. However, the application of these techniques is restricted to compressed sensing domains where a good control is available on the measurement process. The L1-minimization techniques offer tight guarantees  e on recovery (2s < 2 - 1 follows from (Cand`s, 2008) and s + 2s + 3s < 1 from (Cand`s & Tao, 2005)), e provided the algorithm converges to optimal solution to the program (4). These techniques are generally applicable to a broader class of problems of optimizing a convex objective function with L1-penalty. For these techniques, typically the exact bound on the runtime are either very large or not available. The L0-minimization techniques are most promising because they are often based on the greedy approaches that find the solution quickly. Traditional techniques such as MP (Mallat & Zhang, 1993) or OMP (Tropp & Gilbert, 2007) add one variable at a time to the solution leading to a runtime of O(mns). However, newer techniques such as StOMP, ROMP, subspace-pursuit and IHTs add multiple variables to the solution and fall in the class of near linear-time algorithms with a runtime of O(mn poly(log(mn))). Most of these techniques also have conditions on RIP constants  under which the exact recovery is guaranteed. Our results and techniques. We give the first near linear-time algorithm that is guaranteed to find solution to the program (2) under the condition 2s < 1/3. This is the most general condition under which the problem can be solved in near linear-time (the best

min ||~||1 subject to y = ~ x x n

(4)

The above program can be solved in polynomial time using standard linear programming techniques. Approaches to solve sparse regression problem. In order for this theory to be applied in practice, efficient algorithms to solve program (2) or (4) are needed. This is more so in some applications such as medical imaging, where the image sizes (n) are in the range 107 to 109 . In these applications, almost linear time algorithm with small runtime-constants are needed. Algorithms to solve the sparse regression problems may be classified into "L0-minimization algorithms" that directly attempt to solve the program (2), "L1minimization algorithms" that find sparse solutions by solving the program (4) and joint encoding/recovery techniques that design the measurement matrix  along with efficient algorithm for recovery. Examples of L0-minimization algorithms include the classical Matching Pursuit (MP) (Mallat & Zhang, 1993), Orthogonal Matching Pursuit (OMP) (Tropp & Gilbert, 2007), stagewise OMP (StOMP) (Donoho et al., 2006), regularized OMP (ROMP) (Needell & Vershynin, 2009), subspace pursuits (Dai & Milenkovic, 2008), CoSaMP (Needell & Tropp, 2008), SAMP (Do et al., 2008) and iterative hard thresholding (IHTs) (Blumensath & Davies, 2008). The L1-minimization algorithms include the classical basis pursuit (Chen et al., 2001) algorithm, the Lasso-modification to LARS (Efron et al., 2004), random projections onto convex sets (Cand`s & Romberg, e 2004), homotopy (Donoho & Tsaig, 2006), weighted least squares (Jung et al., 2007), iterative algorithms based on gradient thresholding and other gradient based approaches (Lustig, 2008; Ma et al., 2008). Joint encoding/recovery techniques in which the measurement matrix  is designed along with the recovery algorithm include Sudocodes (Sarvotham et al., 2006), unbalanced expander matrices (Berinde et al., 2008), among others. While comparing these algorithms, the following key properties need to be examined: Worst case computational cost. Is there a upper bound on the number of steps needed by the algorithm? For large scale problems where mn is in the range 1010 - 1013 and s in the range 103 - 108 , algorithms requiring O(mns) time are not useful. In such cases, algorithms with runtime bounded by a constant

Gradient Descent with Sparsification Table 1. A comparison of algorithms for sparse recovery. Here m and n denote the number of rows and columns of matrix , K denotes the time taken in performing two matrix operations x and T y. It is equal to 2mn for dense matrices. It may be much less for sparse matrices or special transform which can be computed efficiently (e.g., Fourier, Wavelet transforms). s denotes the sparsity of the solution to be constructed, L denotes the desired bit-precision of the solution and t denote the isometry constants of . Algorithm Basis pursuit (Chen et al., 2001) LARS (Efron et al., 2004) Homotopy (Donoho & Tsaig, 2006) Sudocodes (Sarvotham et al., 2006) Unbalanced expander (Berinde et al., 2008) OMP (Tropp & Gilbert, 2007) StOMP (Donoho et al., 2006) ROMP (Needell & Vershynin, 2009) Subspace pursuit (Dai & Milenkovic, 2008) SAMP (Do et al., 2008) CoSaMP (Needell & Tropp, 2008) IHTs (Blumensath & Davies, 2008) GraDeS (This paper) Cost/ iter. O((m + n)3 ) K + O(s2 ) K O(s log(s) log(n)) O(n log(n/s)) K + O(msL) K + O(msL) K + O(msL) K + O(msL) K + O(msL) K + O(msL) K K Max. # iters. NA Unbounded s NA NA s O(1) s
1-3s L/ log( 10 ) 3s s L L

Recovery condition s + 2s +3s < 1  or 2s < 2 - 1 same as above 1 T j:i=j  2s-1 i Always Always T j:i=j < 1/(2s) i one 2s < 0.03
log(s)

3s < 0.06 3s < 0.06 4s < 0.1  3s < 1/ 32 2s < 1/3

2L/ log( 1-2s ) 22s

 known so far was 3s < 1/ 32 needed by the IHTs algorithm (Blumensath & Davies, 2008)), and is re markably close to the condition 2s < 2 - 1 (Cand`s, e 2008) (which requires computationally expensive linear programming solver). The algorithm is intuitive, easy to implement and has small runtime constants. The algorithm (given in Algorithm 1) is called GraDeS or "Gradient Descent with Sparsification". It starts from an arbitrary sparse x  n and iteratively moves along the gradient to reduce the error (x) = y - x 2 by a step length 1/ and then performs hardthresholding to restore the sparsity of the current solution. The gradient descent step reduces the error (x) by a constant factor, while the RIP of  implies that the sparsification step does not increase the error (x) by too much. An important contribution of the paper is to analyze how the hard-thresholding function Hs acts w.r.t. the potential (x) (see Lemma 2.4). We believe that this analysis may be of independent interest. Overall, a logarithmic number of iterations are needed to reduce the error below a given threshold. A similar analysis also holds for a recovery with noise, where the "best" sparse vector x is only approximately related to the observation y, i.e., y = x + e for some error vector e. We implemented the algorithm in Matlab and found that it outperformed the publicly available implementations of several newly developed near-linear time algorithms by an order of magnitude. The trends suggest a higher speedup for larger matrices. We also

found that while Lasso provides the best theoretical conditions for exact recovery, its LARS-based implementation is first to fail as the sparsity is increased.

2. Algorithm GraDeS and its properties
We begin with some notation. For a positive integer n, let [n] = {1, 2, . . . , n}. For a vector x  n , let supp(x) denote the set of coordinates i  [n] such that xi = 0, thus x 0 = |supp(x)|. For a non-negative integer s, we say that x is s-sparse if x 0  s. We n n 2 1/2 also use x 1 = i=1 |xi | and x 2 = ( i=1 xi ) to denote the 1 and 2 norms of x respectively. For brevity, we use x to denote x 2 . Definition 2.1 Let Hs : n  n be a function that sets all but s largest coordinates in absolute value to zero. More precisely, for x  n , let  be a permutation of [n] such that |x(1) |  · · ·  |x(n) |. Then the vector Hs (x) is a vector x where x(i) = x(i) for i  s and x(i) = 0 for i  s + 1. The above operator is called hard-thresholding and gives the best s-sparse approximation of vector x, i.e., Hs (x) = argminx  n : x 0 s x - x 2 where the minimum is taken over all s-sparse vectors x  n . Noiseless recovery. Our main result for the noiseless recovery is given below. Theorem 2.1 Suppose x is an s-sparse vector satisfying (1) and the isometry constants of the ma-

Gradient Descent with Sparsification

Algorithm 1 Algorithm GraDeS () for solving (1). Initialize x  0. while (x) > do x  Hs x + end while trix  satisfy 2s < 1/3. Algorithm 1, GraDeS with  = 1 + 2s , computes an s-sparse vector x  n such that y - x 2  in 1 · log log((1 - 2s )/22s ) y
2

1 ·  (y - x) . 

Then, we move in the direction opposite to the gradient by a step length given by a parameter  and perform hard-thresholding in order to preserve sparsity. Note that each iteration needs just two multiplications by  or  and linear extra work. We now prove Theorem 2.1. Fix an iteration and let x 1 be the current solution. Let x = Hs (x+  · (y-x)) be the solution computed at the end of this iteration. Let x = x - x and let x = x - x. Note that both x and x are 2s-sparse. Now the reduction in the potential  in this iteration is given by (x ) - (x) = -2 (y - x) · x + x  x   -2 (y - x) · x + (1 + 2s ) x -2 (y - x) · x +  x
2 2

(5) (6)

iterations. Each iteration computes one multiplication of  and one multiplication of  with vectors. The following corollary follows immediately by setting = 2-2L (1 - 2s ). Corollary 2.2 Suppose there exists s-sparse vector x satisfying (1) and the isometry constants of the matrix  satisfy 2s < 1. The vector x can be approximated up to L bits of accuracy in 2(L + log y ) + log(1/(1 - 2s )) log((1 - 2s )/22s ) iterations of Algorithm 1 with  = 1 + 2s . Recovery with noise. Our result for the recovery with noise is as follows. Theorem 2.3 Suppose x is an s-sparse vector satisfying y = x + e for an error vector e  m and the isometry constant of the matrix  satisfies 2s < 1/3. There exists a constant D > 0 that depends only on 2s , such that Algorithm 1 with  = 1 + 2s , computes an s-sparse vector x  n satisfying x - x  D e in at most 1 · log log((1 - 2s )/42s ) iterations. = 2.1. Proof of Theorem 2.1 , let (x) = y - x be a potential For x  function. Our algorithm 1 starts with some initial value of x that is s-sparse, say x = 0, and iteratively reduces (x), while maintaining the s-sparsity, until (x)  . In each iteration, we compute a gradient of (x) = y - x 2 , (x) = -2 (y - x).
n 2

The inequality (5) follows from RIP. Let g = -2 (y- x). Now we prove an important component of our analysis. Lemma 2.4 The vector x achieves the minimum of g·v+ v 2 over all vectors v such that x+v is s-sparse, i.e., x = argminv n : x+v 0 s g · v +  v 2 . Proof. Let v = v  denote the vector such that x + v is s-sparse and that F (v) = g · v +  v 2 is minimized. Let x = x + v  . Let SO = supp(x) \ supp(x + v  ) be the old coordinates in x that are set to zero. Let SN = supp(x+v  )\supp(x) be the new coordinates in x+v  . Similarly, let SI = supp(x)supp(x+v  ) be the common coordinates. Note that F (v  ) = i[n] (gi ·   vi + (vi )2 ). Note that the value of gi · vi + (vi )2 is minimized for vi = -gi /(2). Therefore we get that   vi = -gi /(2) for all i  supp(x + v  ) and vi = -xi for all i  SO . Thus we have

F (v  ) =
iSO

y e

2 2

(-gi · xi + x2 ) + i
iSI SN

gi ·

g2 -gi + i 2 4

(-gi · xi + x2 ) + i
iSO iSI SN

2 -gi 4 2 -gi 4

=
iSO

-gi · xi + x2 + i  xi -
iSO

2 gi 4

+
iSO SI SN 2 -gi 4

=

gi 2 gi 2

2

+
iSO SI SN 2

=


iSO

xi -

+
iSO SI SN

-

gi 2

2

Gradient Descent with Sparsification

Thus, in order to minimize F (v  ), we have to pick the coordinates i  SO to be those with the least value of |xi - gi /(2)| and pick the coordinates j  SN to be those with the largest value of gj /(2) = |xj -gj /(2)|. Note that -gi /(2) = (1/) ·  (y - x) which is the expression used in the while loop of the Algorithm 1. Hence the proof is complete from the definition of Hs . From inequality (6), Lemma 2.4, and the fact that x = x + x is s-sparse, we get (x ) - (x)  =  = -2 (y - x) · x +  x +( - 1 + 2s ) x
 2   2 2

Lemma 2.5 Let C > 0 be a constant satisfying 2s < C2 3C 2 +4C+2 . Algorithm 1 computes an s-sparse vector x  n such that y - x 2  C 2 e 2 in 1 log((1 - 2s )/22s · C 2 /(C + 1)2 ) iterations if we set  = 1 + 2s . Note that the output x in the above lemma satisfies x - x
2

· log

y C2

2

e

2


y - x
2

2

+ 2(y - x) (y - x )
2 2

+ y - x  C = e
2

+ 2C e · e + e

-2 (y - x) · x + (1 - 2s ) x

(C + 1)2 e 2 .

-2 (y - x) · x + (x )  x +( - 1 + 2s ) x
 2  2

This combined with RIP implies that x - x 2  (C + 1)2 e 2 /(1 - 2s ), thereby proving Theorem 2.3. (7) Proof of Lemma 2.5: We work with the same notation as in Section 2.1. Let the current solution x satisfy (x) = y - x 2 > C 2 e 2 , let x = 1 Hs (x+  · (y -x)) be the solution computed at the end of an iteration, let x = x - x and x = x - x. The initial part of the analysis is same as that in Section 2.1. Using (10), we get x
2

(x ) - (x) + ( - 1 + 2s ) x  - 1 + 2s  -(x) + · x 2 1 - 2s  - 1 + 2s · (x)  -(x) + 1 - 2s

(8) (9)

The inequalities (7) and (8) follow from RIP, while (9) follows from the fact that x 2 = (x). Thus we get (x )  (x) · ( - 1 + 2s )/(1 - 2s ). Now note that 2s < 1/3 implies that ( - 1 + 2s )/(1 - 2s ) < 1, and hence the potential decreases by a multiplicative factor in each iteration. If we start with x = 0, the initial potential is y 2 . Thus after 1 · log log((1 - 2s )/( - 1 + 2s ) y
2

= =

y - x - e

2

(x) - 2e (y - x) + e 2 1 2 y - x 2 + 2 · (x)  (x) + C C 2 1  1+ · (x). C Using the above inequality in inequality (8), we get (x )  1  - 1 + 2s · 1+ 1 - 2s C
2

iterations, the potential becomes less than . Setting  = 1 + 2s , we get the desired bounds. If the value of 2s is not known, by setting  = 4/3, the same algorithm computes the above solution in 1 · log log((1 - 2s )/(1/3 + 2s )) iterations. 2.2. Proof of Theorem 2.3 The recovery under noise corresponds to the case where there exists an s-sparse vector x  n such that y = x + e (10) for some error vector e  m . In order to prove Theorem 2.3, we prove the following stronger lemma. y
2

· (x).

Our assumption 2s <
2

22s 1 1 + C = 1-2s · 1 long as (x) > C 2 e , the potential (x) decreases by a multiplicative factor. Lemma 2.5 thus follows.

-1+2s C2 3C 2 +4C+2 implies that 1-2s · 2 1 + C < 1. This implies that as 2

3. Implementation Results
To evaluate usefulness of GraDeS in practice, it was compared with LARS-based implementation of Lasso (Efron et al., 2004) (referred to as Lasso/LARS), Orthogonal Matching Pursuit (OMP), Stagewise OMP (StOMP) and Subspace Pursuit. The algorithms were selected to span the spectrum of the algorithms ­ on one extreme, Lasso/LARS, that provides the best recovery conditions but a poor bound on its runtime and

Gradient Descent with Sparsification
110 3000 4000 100 5000 6000 7000 90 8000 80
Lasso/LARS OMP 38.71 Lasso/LARS 66.28 51.36 OMP 69.54 62.13 77.64 74.73 StOMP 89.99 86.84 Subspace Pursuit 101.07 98.47 GraDeS (3) StOMP 23.29 29.45 35.14 41.97 45.18 Subspace Pursuit GraDeS (3) GraDeS (4/3) 22.04 9.57 23.72 10.06 3.78 22.71 10.41 3.71 25.9 11.44 4.27 30.08 12.24 4.23 34.31 13.53 4.82

Run time (sec)

70 60 50 40 30 20 10 0 3000 4000 5000 6000 7000 8000

Run time (sec)

GraDeS (4/3)

90 6000 85 8000 10000 80 12000 75 14000 70 16000 65 60 55 50 45 40 35 30 25 20 15 10 5 0 6000

Lasso/LARS OMP 56.86 71.11

StOMP 42.67 54.76 60.34 66.89 76.35 89.07 21.91 26.73 27.53 32.84 36.84 40.08

Subspace Pursuit GraDeS (3) GraDeS (4/3) 23.01 8.14 2.95 25.16 10.72 4.05 25.22 12.21 4.76 23.82 15.01 5.93 23.58 19.6 7.36 Lasso/LARS 29.74 20.22 OMP 8.94

StOMP Subspace Pursuit GraDeS (3) GraDeS (4/3)

8000

10000

12000

14000

16000

Number of rows (m)

Number of columns (n)

Figure 1. Dependence of the running times of different algorithms on the number of rows (m). Here number of columns n = 8000 and sparsity s = 500.

Figure 2. Dependence of the running times of different algorithms on the number of columns (n). Here number of rows m = 4000 and sparsity s = 500.
Lasso/LARS OMP StOMP 11.47 11.05 24.65 23.75 110300 39.06 Lasso/LARS 38.17 105400 58.18 54.27 OMP 86.33 75.36 100500 StOMP 600 115.17 94.46 95 Subspace Pursuit
120 100 115200 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 100

in the other extreme, near linear-time algorithms with weaker recovery conditions but very good run times. Matlab implementation of all the algorithms were used for evaluation. The SparseLab (Donoho & Others, 2009) package, which contains the implementation of Lasso/LARS, OMP and StOMP algorithms was used. In addition, a publicly available optimized Matlab implementation of Subspace Pursuit, provided by one of its authors, was used. The algorithm GraDeS was also implemented in Matlab for a fair comparison. GraDeS( = 4/3) for which all our results hold was evaluated along with GraDeS( = 3) which was a bit slower but had better recovery properties. All the experiments were run on a 3 GHz dual-core Pentium system with 3.5 GB memory running the Linux operating system. Care was taken to ensure that no other program was actively consuming the CPU time while the experiments were in progress. First, a random matrix  of size m × n with (iid) normally distributed entries and a random s-sparse vector x were generated. The columns of  were normalized to zero mean and unit norm. The vector y = x was computed. Now the matrix , the vector y and the parameter s were given to each of the algorithms as the input. The output of every algorithm was compared with the correct solution x. Figure 1 shows the runtime of all the algorithms as a function of the number of rows m. The algorithms Lasso/LARS and OMP take the maximum time which increases linearly with m. This is followed by StOMP and Subspace Pursuit that also show a linear increase in runtime as a function of m. The algorithm GraDeS(4/3) has the smallest runtime which is a factor 20 better than the runtime of Lasso/LARS and

6.65 11.31 15.08 25.14 31.02 38.48

Subspace Pursuit GraDeS (3) GraDeS 1.26 7.95 4.23 9.23 8.72 10.05 17.95 12.07 22.28 13.53 44.24 14.88

(4/3) 2.65 3.25 3.84 4.24 5.02 5.51

GraDeS (3) GraDeS (4/3)

Run time (sec)

200

300

400

500

600

Sparsity (s)

Figure 3. Dependence of the running times of different algorithms on the sparsity (s). Here number of rows m = 5000 and number of columns n = 10000.

a factor 7 better than that of Subspace Pursuit for m = 8000, n = 8000 and s = 500. GraDeS(3) takes more time than GraDeS(4/3) as expected. It is surprising to observe that, contrary to the expectations, the runtime of GraDeS did not increase substantially as m was increased. It was found that GraDeS needed fewer iterations offsetting the additional time needed to compute products of matrices with larger m. Figure 2 shows the runtime of these algorithms as the number of column, n is varied. Here, all the algorithms seem to scale linearly with n as expected. In this case, Lasso/LARS did not give correct results for n > 8000 and hence its runtime was omitted from the graph. As expected, the runtime of GraDeS is an order of magnitude smaller than that of OMP and Lasso/LARS. Figure 3 shows the runtime of the algorithms as a func-

Gradient Descent with Sparsification Table 2. A comparison of different algorithms for various parameter values. An entry "Y" indicates that the algorithm could recover a sparse solution, while an empty entry indicates otherwise.

solution. However, instances for which Lasso/LARS gave the correct sparse solution but GraDeS failed, were also found.

GraDeS ( = 4/3)

Subspace Pursuit

4. Conclusions
In summary, we have presented an efficient algorithm for solving the sparse reconstruction problem provided the isometry constants of the constraint matrix satisfy 2s < 1/3. Although the recovery conditions for GraDeS are stronger than those for L1-regularized regression (Lasso), our results indicate that whenever Lasso/LARS finds the correct solution, GraDeS also finds it. Conversely, there are cases where GraDeS (and other algorithms) find the correct solution but Lasso/LARS fails due to numerical issues. In the absence of efficient and numerically stable algorithms, it is not clear whether L1-regularization offers any advantages to practitioners over simpler and faster algorithms such as OMP or GraDeS when the matrix  satisfies RIP. A systematic study is needed to explore this. Finally, finding more general conditions than RIP under which the sparse reconstruction problem can be solved efficiently is a very challenging open question.

GraDeS ( = 3)

Lasso/LARS

StOMP

m 3000 3000 3000 3000 6000 3000 4000 8000

n 10000 10000 8000 10000 8000 10000 10000 8000

s 2100 1050 500 600 500 300 500 500

Y Y Y Y Y Y Y Y

Y

Y Y Y Y Y Y

Y Y Y Y Y Y Y

OMP

Y Y Y Y Y Y Y

tion of the sparsity parameter s. The increase in the runtime of Lasso/LARS and OMP is super-linear (as opposed to a linear theoretical bound for OMP). Although, the theoretical analysis of StOMP and Subspace Pursuit shows very little dependence on s, their actual run times do increase significantly as s is increased. In contrast, the run times of GraDeS increase only marginally (due to a small increase in the number of iterations needed for convergence) as s is increased. For s = 600, GraDeS(4/3) is factor 20 faster than Lasso/LARS and a factor 7 faster than StOMP, the second best algorithm after GraDeS. Table 2 shows the recovery results for these algorithms. Quite surprisingly, although the theoretical recovery conditions for Lasso are the most general (see Table 1), it was found that the LARS-based implementation of Lasso was first to fail in recovery as s is increased. A careful examination revealed that as s was increased, the output of Lasso/LARS became sensitive to error tolerance parameters used in the implementation. With careful tuning of these parameters, the recovery property of Lasso/LARS could be improved. The recovery could be improved further by replacing some incremental computations with slower non-incremental computations. One such computation was incremental Cholesky factorization. These changes adversly impacted the running time of the algorithm, increasing its cost per iteration from K + O(s2 ) to 2K + O(s3 ). Even after these changes, some instances were discovered for which Lasso/LARS did not produce the correct sparse (though it computed the optimal solution of the program (4)), but GraDeS(3) found the correct

References
Berinde, R., Indyk, P., & Ruzic, M. (2008). Practical near-optimal sparse recovery in the L1 norm. Allerton Conference on Communication, Control, and Computing. Monticello, IL. Blumensath, T., & Davies, M. E. (2008). Iterative hard thresholding for compressed sensing. Preprint. Cand`s, E., & Wakin, M. (2008). An introduction e to compressive sampling. IEEE Signal Processing Magazine, 25, 21­30. Cand`s, E. J. (2008). The restricted isometry property e and its implications for compressed sensing. Compte Rendus de l'Academie des Sciences, Paris, 1, 589­ 592. Cand`s, E. J., & Romberg, J. (2004). Practical signal e recovery from random projections. SPIN Conference on Wavelet Applications in Signal and Image Processing. Cand`s, E. J., Romberg, J. K., & Tao, T. (2006). e Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52, 489­509. Cand`s, E. J., & Tao, T. (2005). Decoding by linear e

Gradient Descent with Sparsification

programming. IEEE Transactions on Information Theory, 51, 4203­4215. Cand`s, E. J., & Tao, T. (2006). Near-optimal signal e recovery from random projections: Universal encoding strategies? IEEE Transactions on Information Theory, 52, 5406­5425. Chen, S. S., Donoho, D. L., & Saunders, M. A. (2001). Atomic decomposition by basis pursuit. SIAM Review, 43, 129­159. Dai, W., & Milenkovic, O. (2008). Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity. ArXiv e-prints. Do, T. T., Gan, L., Nguyen, N., & Tran, T. D. (2008). Sparsity adaptive matching pursuit algorithm for practical compressed sensing. Asilomar Conference on Signals, Systems, and Computers. Pacific Grove, California. Donoho, D., & Others (2009). SparseLab: Seeking sparse solutions to linear systems of equations. http://sparselab.stanford.edu/. Donoho, D. L., & Tsaig, Y. (2006). Fast solution of L1 minimization problems when the solution may be sparse. Technical Report, Institute for Computational and Mathematical Engineering, Stanford University. Donoho, D. L., Tsaig, Y., Drori, I., & Starck, J.-L. (2006). Sparse solution of underdetermined linear equations by stagewise orthogonal matching pursuit. Preprint. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32, 407­499. Figueiredo, M. A. T., & Nowak, R. D. (2005). A bound optimization approach to wavelet-based image deconvolution. IEEE International Conference on Image Processing (ICIP) (pp. 782­785). Golub, G., & Loan, C. V. (1996). Matrix computations, 3rd ed. Johns Hopkins University Press. Jung, H., Ye, J. C., & Kim, E. Y. (2007). Improved k-t BLASK and k-t SENSE using FOCUSS. Phys. Med. Biol., 52, 3201­3226. Lustig, M. (2008). Sparse MRI. Ph.D Thesis, Stanford University. Ma, S., Yin, W., Zhang, Y., & Chakraborty, A. (2008). An efficient algorithm for compressed MR imaging

using total variation and wavelets. IEEE Confererence on Computer Vision and Pattern Recognition (CVPR) (pp. 1­8). Mallat, S. G., & Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 3397­3415. Natarajan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM Journal of Computing, 24, 227­234. Needell, & Tropp, J. A. (2008). CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26, 301­321. Needell, D., & Vershynin, R. (2009). Uniform uncertainty principle and signal recovery via regularized orthogonal matching pursuit. Foundations of Computational Mathematics, 9, 317­334. Neylon, T. (2006). Sparse solutions for linear prediction problems. Doctoral dissertation, Courant Institute, New York University. Ranzato, M., Boureau, Y.-L., & LeCun, Y. (2007). Sparse feature learning for deep belief networks. In Advances in neural information processing systems 20 (NIPS), 1185­1192. MIT Press. Sarvotham, S., Baron, D., & Baraniuk, R. (2006). Sudocodes - fast measurement and reconstruction of sparse signals. IEEE International Symposium on Information Theory (ISIT) (pp. 2804­2808). Seattle, Washington. Tropp, J. A., & Gilbert, A. C. (2007). Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. Info. Theory, 53, 4655­ 4666. Wainwright, M. J., Ravikumar, P., & Lafferty, J. D. (2006). High-dimensional graphical model selection using 1 -regularized logistic regression. In Advances in neural information processing systems 19 (NIPS'06), 1465­1472. Cambridge, MA: MIT Press. Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15, 262­286.