Submodular Dictionary Selection for Sparse Representation Andreas Krause California Institute of Technology, Computer Science Department krausea@caltech.edu Volkan Cevher volkan.cevher@{epfl,idiap}.ch Ecole Polytechnique Federale de Lausanne, STI-IEL-LIONS & Idiap Research Institute Abstract We develop an efficient learning framework to construct signal dictionaries for sparse representation by selecting the dictionary columns from multiple candidate bases. By sparse, we mean that only a few dictionary elements, compared to the ambient signal dimension, can exactly represent or well-approximate the signals of interest. We formulate both the selection of the dictionary columns and the sparse representation of signals as a joint combinatorial optimization problem. The proposed combinatorial objective maximizes variance reduction over the set of training signals by constraining the size of the dictionary as well as the number of dictionary columns that can be used to represent each signal. We show that if the available dictionary column vectors are incoherent, our objective function satisfies approximate submodularity. We exploit this property to develop SDSOM P and SDSM A , two greedy algorithms with approximation guarantees. We also describe how our learning framework enables dictionary selection for structured sparse representations, e.g., where the sparse coefficients occur in restricted patterns. We evaluate our approach on synthetic signals and natural images for representation and inpainting problems. In dictionary design, researchers assume an abstract functional space that can concisely capture the underlying characteristics of the signals. A classical example is based on Besov spaces and the set of natural images, for which the Besov norm measures spatial smoothness between edges (c.f., Choi & Baraniuk (2003) and the references therein). Along with the functional space, a matching dictionary is naturally introduced, e.g., wavelets (W) for Besov spaces, to efficiently calculate the induced norm. Then, the rate distortion of the parD tial signal reconstructions yk is quantified by keeping the k largest dictionary elements via an p norm, such D D as p (y, yk ) = y - yk D p (y, yk ) p d i=1 D yi - yk,i p 1/p ; the faster decays with k, the better the observations can be compressed. While the designed dictionaries have well-characterized rate distortion and approximation performance on signals in the assumed functional space, they are data-independent and hence their empirical performance on the actual observaW tions can greatly vary: 2 (y, yk ) = O(k -0.1 ) (prac-0.5 tice) vs. O(k ) (theory) for wavelets on natural images (Cevher, 2008). In dictionary learning, researchers develop algorithms to learn a dictionary for sparse representation directly from data using techniques such as regularization, clustering, and nonparametric Bayesian inference. Regularization-based approaches define an objective function that minimize the data error, regularized by the 1 or the total variation (TV) norms to enforce sparsity under the dictionary representation. The proposed objective function is then jointly optimized in the dictionary entries and the sparse coefficients (Olshausen & Field, 1996; Zhang & Chan, 2009; Mairal et al., 2008). Clustering approaches learn dictionaries by sequentially determining clusters where sparse coefficients overlap on the dictionary and then updating the corresponding dictionary elements based on singular value decomposition (Aharon et al., 2006). Bayesian approaches use hierarchical probability models to nonparametrically infer the dictionary size and 1. Introduction An important problem in machine learning, signal processing and computational neuroscience is to determine a dictionary of basis functions for sparse representation of signals. A signal y Rd has a sparse representation with y = D in a dictionary D Rd×n , when k d coefficients of can exactly represent or well-approximate y. Myriad applications in data analysis and processing­from deconvolution to data mining and from compression to compressive sensing­involve such representations. Surprisingly, there are only two main approaches for determining data-sparsifying dictionaries: dictionary design and dictionary learning. Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Submodular Dictionary Selection for Sparse Representation its composition (Zhou et al., 2009). Although dictionary learning approaches have great empirical performance on many data sets in denoising and inpainting of natural images, they lack theoretical rate distortion characterizations of the dictionary design approaches. In this paper, we investigate a hybrid approach between dictionary design and learning. We propose a learning framework based on dictionary selection: We build a sparsifying dictionary for a set of observations by selecting the dictionary columns from multiple candidate bases, typically designed for the observations of interest. We constrain the size of the dictionary as well as the number of dictionary columns that can be used to represent each signal with user-defined parameters n and k, respectively. We formulate both the selection of basis functions and the sparse reconstruction as a joint combinatorial optimization problem. Our objective function maximizes a variance reduction metric over the set of observations. We then propose SDSOM P and SDSM A , two computationally efficient, greedy algorithms for dictionary selection. We show that under certain incoherence assumptions on the candidate vectors, the dictionary selection problem is approximately submodular, and we use this insight to derive theoretical performance guarantees for our algorithms. We also demonstrate that our framework naturally extends to dictionary selection with restrictions on the allowed sparsity patterns in signal representation. As a stylized example, we study a dictionary selection problem where the sparse signal coefficients exhibit block sparsity, e.g., sparse coefficients appear in pre-specified blocks. Lastly, we first evaluate the performance of our algorithms in both on synthetic and real data. Our main contributions can be summarized as follows: 1. We introduce the problem of dictionary selection and cast the dictionary learning/design problems in a new, discrete optimization framework. 2. We propose new algorithms and provide their theoretical performance characterizations by exploiting a geometric connection between submodularity and sparsity. 3. We extend our dictionary selection framework to allow structured sparse representations. 4. We evaluate our approach on several real-world sparse representation and image inpainting problems and show that it provides practical insights to existing image coding standards. of signals Y = {y1 , . . . , ym } Rd×m . We compose D using the variance reduction metric, defined below, by selecting a subset of a candidate vector set V = {1 , . . . , N } Rd×N . Without loss of generality, we assume yi 2 1 and i 2 = 1, i. In the sequel, we define A = [i1 , . . . , iQ ] as a matrix containing the vectors in V as indexed by A = {i1 , . . . , iQ } where A V and Q = |A| is the cardinality of the set A. We do not assume any particular ordering of V. DiSP objectives: For a fixed signal ys and a set of vectors A, we define the reconstruction accuracy as 2 Ls (A) = 2 (ys , y A ) = min ||ys - A w||2 . 2 w (1) The problem of optimal k-sparse representation with respect to a fixed dictionary D then requires solving the following discrete optimization problem: As = argmin Ls (A), AD,|A|k (2) where k is the user-defined sparsity constraint on the number of columns in the reconstruction. In DiSP, we are interested in determining a dictionary D V that obtains the best possible reconstruction accuracy for not only a single signal but all signals Y. Each signal ys can potentially use different columns As D for representation; we thus define Fs (D) = Ls () - AD,|A|k min Ls (A), (3) where Fs (D) measures the improvement in reconstruction accuracy, also known as variance reduction, for the signal ys and the dictionary D. Moreover, we define the average improvement for all signals as F (D) = 1 m Fs (D). s The optimal solution to the DiSP is then given by D = argmax F (D), |D|n (4) where n is a user-defined constraint on the number of dictionary columns. For instance, if we are interested in selecting a basis, we have n = d. DiSP challenges: The optimization problem in (4) presents two combinatorial challenges. (C1) Evaluating Fs (D) requires finding the set As of k basis functions­out of exponentially many options­for the best reconstruction accuracy of ys . (C2) Even if we could evaluate Fs , we would have to search over an exponential number of possible dictionaries to determine D for all signals. Even the special case of k = n is NP-hard (Davis et al., 1997). To circumvent these 2. The dictionary selection problem In the dictionary selection problem (DiSP), we seek a dictionary D to sparsely represent a given collection Submodular Dictionary Selection for Sparse Representation combinatorial challenges, the existing dictionary learning work relies on continuous relaxations, such as replacing the combinatorial sparsity constraint with the 1 -norm of the dictionary representation of the signal. However, these approaches result in non-convex objectives, and the performance of such relaxations is typically not well-characterized for dictionary learning. v ys||{v} ys v sin(-) ys ws,v 1/2 cos( -) sin() ys||{A} A cos() ws,A 1/2 3. Submodularity in sparse representation In this section, we first describe a key structure in the DiSP objective function: approximate submodularity. We then relate this structure to a geometric property of the candidate vector set, called incoherence. We use these two concepts to develop efficient algorithms with provable guarantees in the next section. Approximate submodularity in DiSP: To define this concept, we first note that F () = 0 and whenever D D then F (D) F (D ), i.e., F increases monotonically with D. In the sequel, we will show that F is approximately submodular: A set function F is called approximately submodular with constant , if for D D V and v V \ D it holds that F (D {v}) - F (D) F (D {v}) - F (D ) - . (5) In the context of DiSP, the above definition implies that adding a new column v to a larger dictionary D helps at most more than adding v to a subset D D . When = 0, the set function is called submodular. A fundamental result by Nemhauser et al. (1978) proves that for monotonic submodular functions G with G() = 0, a simple greedy algorithm that starts with the empty set D0 = , and at every iteration i adds a new element via vi = argmax G(Di-1 {v}), vV\D A Figure 1. Example geometry in DiSP. (Left) Minimum error decomposition. (Right) Modular decomposition. Geometry in DiSP (incoherence): The approximate submodularity of F explicitly depends on the maximum incoherency µ of V = [1 , . . . , N ]: µ= (i,j),i=j max | i , j | = (i,j),i=j max |cos i,j | , where i,j is the angle between the vectors i and j . The following lemma establishes a key relationship between and µ for DiSP. Theorem 1 If V has incoherence µ, then the variance reduction objective F in DiSP is -approximately submodular with 4kµ. Proof Let ws,v = v , ys 2 . When V is an orthonormal basis, the reconstruction accuracy in (1) can be written as follows Q 2 Ls (A) = ys - q=1 iq ys , iq 2 = ||ys ||2 - 2 vA ws,v . Hence the function Rs (A) Ls () - Ls (A) = vA ws,v is additive (modular). It can be seen that then Fs (D) = maxAD,|A|k Rs (A) is submodular. Now suppose V is incoherent with constant µ. Let A V and v V \ A. Then we claim that |Rs (A {v}) - Rs (A) - ws,v | µ. Consider the special case where ys is in the span of two subspaces A and v, and w.l.o.g., ||ys ||2 = 1; refer to Fig. 1 for an illustration. The reconstruction accuracy as defined in (1) has a well-known closed form solution: Ls (A) = minw ||ys -A w||2 = ||ys -A ys ||2 , where denotes 2 2 A the pseudoinverse; the matrix product P = A is A simply the projection of the signal ys onto the subspace of A. We therefore have Rs (A) = 1 - sin2 (), Rs (A {v}) = 1, and Rs ({v}) = 1 - sin2 ( - ), where and are defined in Fig. 1. We thus can bound s |Rs (A {v}) - Rs (A) - wv,s | by s max sin2 ( - ) + sin2 () - 1 = |cos | max |cos( - 2)| = µ. (6) where Di = {v1 , . . . , vi }, obtains a near-optimal solution. That is, for the solution Dn returned by the greedy algorithm, we have the following guarantee: G(Dn ) (1 - 1/e) max G(D). |D|n (7) The solution Dn hence obtains at least a constant fraction of (1 - 1/e) 63% of the optimal value. Using similar arguments, Krause et al. (2008) show that the same greedy algorithm, when applied to approximately submodular functions, instead inherits the following­slightly weaker­guarantee F (Dn ) (1 - 1/e) max F (D) - n. |D|n (8) In Section 4, we explain how this greedy algorithm can be adapted to DiSP. But first, we elaborate on how depends on the candidate vector set V . If ys is not in the span of A {v}, we apply above reasoning to the projection of ys onto their span. Submodular Dictionary Selection for Sparse Representation Define Rs (A) = Then, by induction, vA ws,v . we have |Rs (A) - Rs (A)| kµ. Note that the function Fs (D) = maxAD,|A|k Rs (A) is submodular. Let As = argmaxAD,|A|k Rs (A) and As = argmaxAD,|A|k Rs (A). Therefore, it holds that Fs (D) = Rs (As ) R(As )+kµ R(As )+kµ = Fs (D)+kµ. Similarly, Fs (D) Fs (D) + kµ. Thus, |Fs (D) - Fs (D)| kµ, and hence |F (D) - F (D)| kµ holds for all candidate dictionaries D. Therefore, whenever D D and v D , we can obtain the following / F (D {v}) - F (D) - F (D {v}) + F (D ) F (D {v}) - F (D) - F (D {v}) + F (D ) - 4kµ - 4kµ, which proves the claim. When the incoherency µ is small, the approximation guarantee in (8) is quite useful. There has been a significant body of work establishing the existence and construction of collections V of columns with low coherence µ. For example, it is possible to achieve incoherence µ d-1/2 with the union of d/2 orthonormal bases (c.f. Theorem 2 of Gribonval & Nielsen (2002)). Unfortunately, when n = (d) and = 4kµ, the guarantee (8) is vacuous since the maximum value of F for DiSP is 1. In Section 4, we will show that if, instead of greedily optimizing F , we optimize a modular approximation Fs of Fs (as defined below), we can improve the approximation error from O(nkµ) to O(kµ). A modular approximation to DiSP: The key idea behind the proof of Theorem 1 is that for incoherent dictionaries the variance reduction Rs (A) = Ls () - Ls (A) is approximately additive (modular). We exploit this observation by optimizing a new objective F that approximates F by disregarding the nonorthogonality of V in sparse representation. We do this by replacing the weight calculation ws,A = ys A in F with ws,A = T ys : A 1 Fs (D) = max ws,v , and F (D) = Fs (D), m s AD,|A|k vA Corollary 1 Suppose V is incoherent with constant µ. Then, for any D V, we have |F (D)-F (D)| kµ. Furthermore, F is monotonic and submodular. Corollary 1 shows that F is a close approximation of the DiSP set function F . We exploit this modular approximation to motivate a new algorithm for DiSP and provide better performance bounds in Section 4. 4. Sparsifying dictionary selection In this section, we describe two sparsifying dictionary selection (SDS) algorithms with theoretical performance guarantees: SDSOM P and SDSM A . Both algorithms make locally greedy choices to handle the combinatorial challenges C1 and C2, defined in Section 2. The algorithms differ only in the way they address C1, which we further describe below. Both algorithms tackle C2 by the same greedy scheme in (6). That is, both algorithms start with the empty set and greedily add dictionary columns to solve DiSP. Interestingly, while SDSM A has better theoretical guarantees and is much faster than SDSOM P , Section 6 empirically shows that SDSOM P often performs better. SDSOM P : SDSOM P employs the orthogonal matching pursuit (OMP) (Gilbert & Tropp, 2005) to approximately solve the sparse representation problem in (2) and has the following theoretical guarantee: Theorem 2 SDSOM P uses the scheme in (6) to build a dictionary DOMP one column at a time such that F (DOMP ) (1 - 1/e) max F (D) - k(6n + 2 - 1/e)µ. |D|n Before we prove Theorem 2, we state the following result whose proof directly follows from Theorem 1. Proposition 1 At each iteration, SDSOM P approximates F with a value FOM P such that |FOM P (D) - F (D)| kµ over all dictionaries D. Proof [Theorem 2] From Theorem 1 and Proposition 1 we can see that FOM P is 6knµ-approximately submodular. Thus, according to Krause et al. (2008): FOM P (DOMP ) (1 - 1/e) max FOM P (D) - 6knµ. |D|n (9) where ws,v = v , ys 2 for each ys Rd and v V . We call F a modular approximation of F as it relies on the approximate modularity of the variance reduction Rs . Note that in contrast to (3), Fs (D) in (9) can be exactly evaluated by a greedy algorithm that simply picks the k largest weights ws,v . Moreover, the weights must be calculated only once during algorithm execution, thereby significantly increasing its efficiency. The following immediate Corollary to Theorem 1 summarizes the essential properties of F : (10) Using Proposition 1, we substitute F (DOMP ) + kµ FOM P (DOMP ) and max|D|n FOM P (D) max|D|n F (D) - kµ into (10) to prove the claim. SDSM A : SDSM A greedily (according to (6)) optimizes the modular approximation (MA) F of the DiSP objective F and has the following guarantee: Theorem 3 SDSM A builds a dictionary DMA s.t. F (DM A ) (1 - 1/e) max F (D) - (2 - 1/e)kµ. (11) |D|n Corollary 1 and Theorem 2 directly imply Theorem 3. Submodular Dictionary Selection for Sparse Representation In most realistic settings with high-dimensional signals and incoherent dictionaries, the term (2-1/e)kµ in the approximation guarantee (11) of SDSM A is negligible. This change preserves (approximate) submodularity. 6. Experiments Finding a dictionary in a haystack: To understand how the theoretical performance reflects on the actual performance of the proposed algorithms, we first perform experiments on synthetic data. We generate a collection VU with 400 columns by forming a union of six orthonormal bases with d = 64, including the discrete cosine transform (DCT), different wavelet bases (Haar, Daub4, Coiflets), noiselets, and the Gabor frame. This collection VU is not incoherent--in fact, the various bases contain perfectly coherent columns. As alternatives, we first create a separate collection VS from VU , where we greedily removed columns based on their incoherence, until the remaining collection had incoherence of µS = 0.5. The resulting collection contains 245 columns. We also create a collection VR with 150 random columns of VU , which results in µR = 0.23. For VU,S,R , we repeatedly (50 trials) pick at random a dictionary D V of size n = 64 and generate a collection of m = 100 random 5-sparse signals with respect to the dictionary D . Our goal is to recover the true dictionary D using our SDS algorithms. For each random trial, we run SDSOM P and SDSM A to select a dictionary D of size 64. We then look at the overlap |D D | to measure the performance of selecting the "hidden" basis D . We also report the fraction of remaining variance after sparse reconstruction. Figures 2(a), 2(b), and 2(c) compare SDSOM P and SDSM A in terms of their variance reduction as a function of the selected number of columns. Interestingly, in all 50 trials, SDSOM P perfectly reconstructs the hidden basis D when selecting 64 columns for VS,R . SDSM A performs slightly worse than SDSOM P . Figures 2(e), 2(f), and 2(g) compare the performance in terms of the fraction of incorrectly selected basis functions. Note that, as can be expected, in case of the perfectly coherent VU , even SDSOM P does not achieve perfect recovery. However, even with high coherence, µ = 0.5 for VS , SDSOM P exactly identifies D . SDSM A performs a slightly worse but nevertheless correctly identifies a high fraction of D . In addition to exact sparse signals, we also generate compressible signals, where the coefficients have power-law with decay rate of 2. These signals can be well-approximated as sparse; however, the residual error in sparse representation creates discrepancies in measurements which can be modeled as noise in DiSP. Figures 2(d) and 2(h) repeat the above experiments for VS ; both SDSOM P and SDSM A perform quite well. 5. Sparsifying dictionary selection for block sparse representation Structured sparsity: While many man-made and natural signals can be described as sparse in simple terms, their sparse coefficients often have an underlying, problem dependent order. For instance, modern image compression algorithms, such as JPEG, not only exploit the fact that most of the DCT coefficients of a natural image are small. Rather, they also exploit the fact that the large coefficients have a particular structure characteristic of images containing edges. Coding this structure using an appropriate model enables transform coding algorithms to compress images close to the maximum amount possible and significantly better than a naive coder that just assigns bits to each large coefficient independently (Mallat, 1999). We can enforce structured sparsity for sparse coefficients over the learned dictionaries in DiSP, corresponding to a restricted union-of-subspaces (RUS) sparse model by imposing the constraint that the feasible sparsity patterns are a strict subset of all kdimensional subspaces (Baraniuk et al., 2008). To facilitate such RUS sparse models in DiSP, we must not only determine the constituent dictionary columns, but also their arrangement within the dictionary. While analyzing the RUS model in general is challenging, we here describe below a special RUS model of broad interest to explain the general ideas. Block-sparsity: Block-sparsity is abundant in many applications. In sensor networks, multiple sensors simultaneously observe a sparse signal over a noisy channel. While recovering the sparse signal jointly from the sensors, we can use the fact that the support of the significant coefficients of the signal are common across all the sensors. In DNA microarray applications, specific combinations of genes are also known a priori to cluster over tree structures, called dendrograms. In computational neuroscience problems, decoding of natural images in the primary visual cortex (V1) and statistical behavior of neurons in the retina exhibit clustered sparse responses. To address block-sparsity in DiSP, we replace (3) by Fi (D) = sBi Ls () - AD,|A|k min Ls (A), sBi (12) where Bi is the i-th block of signals (e.g., simultaneous recordings by multiple sensors) that must share the same sparsity pattern. Accordingly, we redefine F (D) = i Fi (D) as the sum across blocks, rather than individual signals, as Section 6 further elaborates. Submodular Dictionary Selection for Sparse Representation Fraction of remaining variance Fraction of remaining variance Fraction of remaining variance Fraction of remaining variance 1 0.8 0.6 0.4 0.2 0 0 50 100 # columns selected SDSOMP SDSMA 1 0.8 0.6 0.4 0.2 0 0 50 100 # columns selected SDSOMP SDSMA 1 0.8 0.6 0.4 0.2 0 0 50 100 # columns selected SDSOMP SDSMA 1 0.8 0.6 0.4 0.2 0 0 50 100 # columns selected SDSOMP SDSMA 1 Fraction of incorrect atoms 0.8 0.6 0.4 0.2 0 0 (a) VU ­sparse Fraction of incorrect atoms SDSOMP SDSMA 1 0.8 0.6 0.4 0.2 0 0 (b) VS ­sparse Fraction of incorrect atoms SDSOMP SDSMA 1 0.8 0.6 0.4 0.2 0 0 (c) VR ­sparse SDSMA Fraction of incorrect atoms SDSOMP 1 0.8 0.6 0.4 0.2 (d) VS ­compressible SDSOMP SDSMA 50 100 # columns selected 50 100 # columns selected 50 100 # columns selected 0 0 50 100 # columns selected (e) VU ­sparse 10 5 (f) VS ­sparse 0.08 SDS SDS Residual variance 0.06 0.025 0.02 0.015 0.01 0.005 0 2 10 OMP (g) VR ­sparse Fraction of incorrect atoms SDSOMP SDSMA Residual variance 1 0.8 0.6 0.4 0.2 0 0 (h) VS ­compressible SDSOMP SDS SDSMAB MA SDSOMP SDSMA MA Running time (s) 0.04 0.02 10 0 10 0 10 1 10 Dimension 2 10 3 0 0.1 0.2 0.3 Incoherence µ 0.4 10 Number of signals 3 50 100 # columns selected (i) Dimension vs. runtime (j) Incoherence vs. res. var. (k) # Signals vs. res. var. (l) Block-sparse Figure 2. Results of 50 trials: (a-c) Variance reduction achieved by SDSOM P and SDSM A on the collections VU,S,R for 5-sparse signals in 64 dimensions. (e-g) Percentage of incorrectly selected columns on the same collections. (d) Variance reduction for compressible signals in 64 dimensions for VS . (h) Corresponding column selection performance. (i) SDSM A is orders of magnitude faster than SDSOM P over a broad range of dimensions. (j) As incoherence decreases, the algorithm effectiveness in variance reduction improve. (k) The variance reduction performance of SDSM A improves with the number of training samples. (l) Exploiting block-sparse structure in signals leads to improved dictionary selection performance. Figure 2(i) compares SDSOM P and SDSM A in running time. As we increase the dimensionality of the problem, SDSM A is several orders of magnitude faster than SDSOM P in our MATLAB implementation. Figure 2(j) illustrates the performance of the algorithms as a function of the incoherence. As predicted by Theorems 2 and 3, lower incoherence µ leads to improved performance of the algorithms. Lastly, Figure 2(k) compares the residual variance as a function of the training set size (number of signals). Surprisingly, as the number of signals increase, the performance of SDSM A improves, and even exceeds that of SDSOM P . We also test the extension of SDSM A to block-sparse signals as discussed in Section 5. We generate 200 random signals each with fixed sparsity pattern, comprising 10 blocks, consisting of 20 signals each. We then compare the standard SDSM A algorithm with the block-sparse variant SDSM AB described in Section 5 in terms of their basis identification performance (see Figure 2(l)). SDSM AB drastically outperforms SDSM A , and even outperforms the SDSOM P algorithm which is computationally far more expensive. Hence, exploiting prior knowledge of the problem structure can significantly aid dictionary selection. A battle of bases on image patches: In this experiment, we try to find the optimal dictionary among an existing set of bases to represent natural images. Since the conventional dictionary learning approaches cannot be applied to this problem, we only present the results of SDSOM P and SDSM A . We sample image patches from natural images, and apply our SDSOM P and SDSM A algorithms to select dictionaries from the collection VU , as defined above. Figures 3(a) (for SDSOM P ) and 3(b) (for SDSM A ) show the fractions of selected columns allocated to the different bases constituting VU for 4000 image patches of size 8 × 8. We restrict the maximum number of dictionary coefficients k for sparse representation to 10% (6). We then observe the following surprising results. While wavelets are considered to be an improvement over the DCT basis for compressing natural images (JPEG2000 vs. JPG), SDSOM P prefer Submodular Dictionary Selection for Sparse Representation 1 Fraction per basis 0.8 0.6 0.4 0.2 0 0 50 100 Number of columns 150 Fraction of residual variance Fraction per basis Fraction per basis DCT Haar Daub4 Coif1 Noise Gabor 1 0.8 0.6 0.4 0.2 0 0 50 100 Number of columns 1 1 Gabor DCT Haar Daub4 Coif1 Noise Gabor 0.8 0.6 0.8 0.6 0.4 0.2 0 0 500 1000 1500 Number of columns 0.4 SDSOMP Daub4 Haar DCT DCT Haar Daub4 Coif1 Noise Gabor 150 0.2 0 20 40 Number of columns 60 2000 (a) Bases used (SDSOM P ) (b) Bases used (SDSM A ) (c) Test set var. reduction (d) Bases used (SDSM A ) Figure 3. Experiments on natural image patches. (a,b,c) Fractions of bases selected for SDSOM P and SDSM A with d = 64 (a,b), and the corresponding variance reduction on test patches. (d) Fractions of bases selected for SDSM A with d = 1024. DCT over wavelets for sparse representation; the cross validation results show that the learned combination of DCT (global) and Gabor functions (local) are better than the wavelets (multiscale) in variance reduction (compression). In particular, Fig. 3(c) demonstrates the performance of the learned dictionary against the various bases that comprise VU on a held-out test set of 500 additional image patches. The variance reduction of the dictionary learned by SDSOM P is 8% lower than the variance reduction achieved by the best basis, which, in this case, is DCT. Moreover, SDSM A , which trades off representation accuracy with efficient computation, overwhelmingly prefers Gabor functions that are used to model neuronal coding of natural images. The overall dictionary constituency varies for SDSOM P and SDSM A ; however, the variance reduction performances are comparable. Finally, Figure 3(d) presents the fraction of selected bases for 32 × 32 sized patches with k = 102, which matches well with the 8×8 DiSP problem above. Dictionary selection from dimensionality reduced data: In this experiment, we focus on a specific image processing problem, inpainting, to motivate a dictionary selection problem from dimensionality reduced data. Suppose that instead of observing Y as assumed in Section 2, we observe Y = P1 y1 , . . . , Pm ym Rb , where Pi Rb×d i are known linear projection matrices. In the inpainting setting, Pi 's are binary matrices which pass or delete pixels. From a theoretical perspective, dictionary selection from dimensionality reduced data is ill-posed. For the purposes of this demonstration, we will assume that Pi 's are information preserving. As opposed to observing a series of signal vectors, we start with a single image in Fig. 4, albeit missing 50% of its pixels. We break the noisy image into non-overlapping 8 × 8 patches, and train a dictionary for sparse reconstruction of those patches to minimize the average approximation error on the observed pixels. As candidate bases, we use DCT, wavelets (Haar and Daub4), Coiflets (1 and 3), and Gabor. We test our SDSOM P and SDSM A algorithms, approaches based on total-variation (TV), linear interpolation, nonlocal TV and the nonparametric Bayesian dictionary learning (based on Indian buffet processes) algorithms (Zhang & Chan, 2009; Mairal et al., 2008; Zhou et al., 2009). The TV and nonlocal TV algorithms use the linear interpolation result as their initial estimates. We set k = 6 (10%). Figure 4 illustrates the inpainting results for each algorithm sorted in increasing peak signal to noise ratio (PSNR). We do not report the reconstruction results using individual candidate bases since they are significantly worse than the baseline linear interpolation. The test image exhibits significant self similarities, restricting the degrees-of-freedom of the sparse coefficients. Hence, for our modular and OMP-based greedy algorithms, we ask the algorithms to select 64 × 32 dimensional dictionaries. While the modular algorithm SDSM A selects the desired dimensions, the OMPbased greedy algorithm SDSOM P terminates when the dictionary dimensions reach 64×19. Given the selected dictionaries, we determine the sparse coefficients that best explain the observed pixels in a given patch and reconstruct the full patch using the same coefficients. We repeat this process for all the patches in the image that differ by a single pixel. In our final reconstruction, we take the pixel median of all the reconstructed patches. SDSOM P performs on par with nonlocal TV while taking a fraction of its computational time. While the Bayesian approach takes significantly more time (a few order of magnitudes slower), it best exploits the self similarities in the observed image to result in the best reconstruction. 7. Conclusions Over the last decade, a great deal of research revolved around recovering, processing, and coding sparse signals. To leverage this experience in new problems, many researchers are now interested in automatically determining data sparsifying dictionaries for their applications. We discussed two alternatives that focus on this problem: dictionary design and dictionary learning. In this paper, we developed a combinatorial theory for dictionary selection that Submodular Dictionary Selection for Sparse Representation Original Incomplete TV (27.55dB) Lin. Interp. (27.58dB) SDSM A (30.32dB) SDSOM P (31.91dB) Nonlocal TV (32.86dB) Bayesian (35.17dB) Figure 4. Comparison of inpainting algorithms. bridges the gap between the two approaches. We explored new connections between the combinatorial structure of submodularity and the geometric concept of incoherence. We presented two computationally efficient algorithms, SDSOM P based on the OMP algorithm, and SDSM A using a modular approximation. By exploiting the approximate submodularity property of the DiSP objective, we derived theoretical approximation guarantees for the performance of our algorithms. We also demonstrated the ability of our learning framework to incorporate structured sparsity representations in dictionary learning. Compared to the dictionary design approaches, our approach is data adaptive and has better empirical performance on data sets. Compared to the continuous nature of the dictionary learning approaches, our approach is discrete and provides new theoretical insights to the dictionary learning problem. We believe that our results pave a promising direction for further research, exploiting combinatorial optimization for sparse representations, in particular submodularity. Acknowledgements. This research is supported by ONR grants N00014-09-1-1044 and N00014-08-1-1112, NSF grant CNS-0932392, AFOSR FA9550-07-1-0301, ARO W911NF-09-1-0383, DARPA N66001-08-1-2065, ARO MURI W911NF-09-1-0383. References Aharon, M., Elad, M., and Bruckstein, A. The kSVD: An algorithm for designing of overcomplete dictionaries for sparse representation. IEEE Trans. on Signal Processing, 54(11):4311­4322, 2006. Baraniuk, R. G., Cevher, V., Duarte, M. F., and Hegde, C. Model-based compressive sensing. IEEE Trans. on Inform. Theory, 56(11):1982­2001, 2010. Cevher, V. Learning with compressible priors. In NIPS, Vancouver, B.C., Canada, 2008. Choi, H. and Baraniuk, R. G. Wavelet statistical models and Besov spaces. Lect. Notes in Statistics, 2003. Davis, G., Mallat, S., and Avellaneda, M. Greedy adaptive approximation. J. Const. Approx., 13:57­ 98, 1997. Gilbert, A. C. and Tropp, J. A. Signal recovery from random measurements via orthogonal matching pursuit. Technical report, University of Michigan, 2005. Gribonval, R. and Nielsen, M. Sparse decompositions in "incoherent" dictionaries. In ICIP, 2002. Krause, A., Singh, A., and Guestrin, C. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. In JMLR, vol. 9, 2008. Mairal, J. and Bach, F. and Ponce, J. and Sapiro, G. Online dictionary learning for sparse coding In ICML, 2008. Mallat, S. G. A wavelet tour of signal processing. Academic Press, 1999. Nemhauser, G., Wolsey, L., and Fisher, M. An analysis of the approximations for maximizing submodular set functions. Math. Prog., 14:265­294, 1978. Olshausen, B. A. and J., Field D. Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature, 381:607­609, 1996. Zhang, X. and Chan, T. F. Wavelet Inpainting by Nonlocal Total Variation. CAM Report (09-64), 2009. Zhou, M., Chen, H., Paisley, J., Ren, L., Sapiro, G., and Carin, L. Non-parametric bayesian dictionary learning for sparse image representations. NIPS, '09.