Compressed Sensing and Bayesian Exp erimental Design Matthias W. Seeger seeger@tuebingen.mpg.de Hannes Nickisch hn@tuebingen.mpg.de Max Planck Institute for Biological Cybernetics, Spemannstr. 38, Tubingen, Germany ¨ Abstract We relate compressed sensing (CS) with Bayesian experimental design and provide a novel efficient approximate method for the latter, based on expectation propagation. In a large comparative study about linearly measuring natural images, we show that the simple standard heuristic of measuring wavelet coefficients top-down systematically outperforms CS methods using random measurements; the sequential pro jection optimisation approach of (Ji & Carin, 2007) performs even worse. We also show that our own approximate Bayesian method is able to learn measurement filters on full images efficiently which outperform the wavelet heuristic. To our knowledge, ours is the first successful attempt at "learning compressed sensing" for images of realistic size. In contrast to common CS methods, our framework is not restricted to sparse signals, but can readily be applied to other notions of signal complexity or noise models. We give concrete ideas how our method can be scaled up to large signal representations. 1. Intro duction There has been a lot of recent interest in the area of compressed sensing (CS) (Cand`s et al., 2006; e Donoho, 2006), where it is argued that if signals can be expected to be compressible due to sparseness after some linear transform, then they can be reconstructed from a number of measurements significantly below the Nyquist/Shannon limit, if the measurement design is not too regular. In this paper, we relate CS to the more general notion of statistical (Bayesian) experimental design. Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Through this view, characteristics of signals and algorithms, defined in an abstract mathematical way in the CS literature so far, become understandable and workable. The experimental design approach applies to signals of low complexity in general, not only to sparse ones. It has the potential to clearly outperform the randomised designs, favoured by theoretical CS arguments, in cases where signals are not welldescribed by common CS assumptions. For example, CS has been viewed with some scepticism so far by researchers in computer vision and image statistics (Weiss et al., 2007). While images exhibit transform sparsity to some degree, purely random measurement designs can be suboptimal for them. The reason is that there is more to low-level image statistics than sparsity. Much of this knowledge can be modeled tractably (Simoncelli, 1999) and could therefore be incorporated into a Bayesian experimental design architecture. To our knowledge, the current CS reconstruction schemes are purely estimation-based and lack proper representations of uncertainty (which is what fundamentally drives experimental design), and the theory deals exclusively with signals which are unstructured except for random sparsity. We present experimental results sheding more light on the relationship between CS and images. Similar to (Weiss et al., 2007), we find that standard approaches to linear image measurement (wavelet coefficients) give significantly better reconstruction results than using random measurements favoured by CS, even if modern CS reconstruction algorithms are applied. Yet, our experimental evidence is more substantial than theirs. Beyond that, we show that our efficient approximation to sequential Bayesian design can be used to learn measurements which indeed outperform measuring wavelet coefficients topdown. Our method provides a practically efficient solution to the problem posed in (Weiss et al., 2007), namely how to learn measurement filters automatically from data (using very little concrete knowledge about the signal class) which perform close to or even better than "standard" ones obtained through decades of research and experience. In contrast, the uncertain Compressed Sensing and Bayesian Exp erimental Design components analysis algorithm suggested by them requires a large database of image patches to be run, and could hardly be scaled up to the realistic dimensions treated here1 . An approximate Bayesian approach to compressed sensing has been presented in (Ji & Carin, 2007), making use of sparse Bayesian learning (SBL) (Tipping, 2001). Our method is based on a different, more general inference approximation, expectation propagation (Minka, 2001), and outperforms theirs very significantly, for prediction based on the same design and, even more so, for sequential design optimisation, as we show in comparative experiments below. Moreover, strongly underdetermined problems (many more variables than observations) are dealt with more efficiently in our framework. In addition, our framework is generalised easily to non-Gaussian observation likelihoods, skew prior terms, and generalised linear models (Gerwinn et al., 2008), and our methodology, our comparisons, as well as our discussion here have a broader scope. Our method is an extension of the scheme in (Seeger et al., 2007). However, the applications to images considered here are orders of magnitude larger than theirs, and several novel ideas are proposed here in order to increase computational efficiency substantially. While much work has been done in statistics on experimental design for the classical Gaussian-linear model, Gaussian priors are entirely inappropriate for images2 , and designs optimized for them are suboptimal (see also (Seeger et al., 2007)). We are not aware of existing methods for the model used here, which scale comparable to ours, with the exception of (Ji & Carin, 2007). A different approach for optimising measurement design is given in (Elad, 2007), where X is designed a priori with the aim of making its rows maximally decoherent. In our setup, X is designed sequentially, using Bayesian information criteria. The structure of the paper is as follows: The experimental design view on CS is detailed in Section 2. Our framework for approximate inference is described in Section 3, where we also show how to apply it to large problems, especially in sequential experimental design. Our approach is validated through a series of experiments, comparing it to (Ji & Carin, 2007) and common CS methods on artificial data (Section 4.1), 1 Their experiments are on 4 × 4 image patches, while ours run efficiently on 64 × 64 images. 2 Reconstruction under the Gaussian-linear model is simply the method of least squares, often referred to as "linear reconstruction". Much of the improved performance through CS is due to the use of non-linear sparse reconstruction techniques. and analysing the suitability of CS and Bayesian experimental design on natural images (Section 4.2). 2. Compressed Sensing and Exp erimental Design Compressed sensing (CS) (Cand`s et al., 2006; e Donoho, 2006) can be motivated as follows. Suppose a signal, such as an image or a sound waveform, is measured and then transferred over some channel or stored. Traditionally, the measurement obeys the Nyquist/Shannon theorem, allowing for an exact reconstruction of the (band-limited) signal if there is no measurement noise. However, what follows is usually some form of lossy compression, exploiting redundancies and non-perceptibility of losses. Given that, can the information needed for a satisfactory reconstruction not be measured below the Nyquist frequency (this is called undersampling)? In many key applications today, the measurement itself is the main bottleneck for cost reductions or higher temporal/spatial resolution. Recent theoretical results indicate that undersampling should work well if randomized designs are used, and if the signal reconstruction method specifically takes the "compressibility" into account. Bayesian experimental design encompasses the CS problem. Here, the "compressibility" of signals is encoded in a prior distribution, under which signals of low complexity in general, or high (transform) sparsity in particular, have most mass. While an undersampling violates the Nyquist theorem, signals can often still be reconstructed if they are sufficiently likely under the prior. But not every way of undersampling will do. Experimental design is concerned with optimising the measurement structure (called design), so as to obtain the desired information at the lowest possible cost. This is easily explained by considering the model of interest here. Let u Rn be latent variables (pixels of an image), and let y Rm be noisy measurements thereof. The model class of interest is P (u |y ) N (y |X u , 2 I ) iq =1 ti (si ), s = Bu. (1) The likelihood P (y |u ) is Gaussian and underdetermined (n > m). The prior3 is a product of univariate non-Gaussian potentials ti (si ). It is computationally advantageous, yet not essential, that the log ti be concave (Seeger, 2008), and in this paper we use Laplacian 3 We do not require that the prior potential is actually a normalisable distribution over u , the models of interest here are of the undirected Markov random field (or "energy-based") type. Compressed Sensing and Bayesian Exp erimental Design potentials - | s i | e , (2) 2 which are of this sort. If number of image pixels n is large, it is important for computational efficiency that matrix-vector multiplications (MVMs) with B and B T (less important: with X , X T ) can be done efficiently, and that B does not have to be stored explicitly. ti (si ) = The unknown signal u (an image for now) should be "compressible", i.e. it should exhibit transform sparsity4 : after some fixed linear mapping B , such as a wavelet transform, s = B u has many coefficients si close to zero. An image coder would set these to exactly zero, thereby compressing the image. "Expected transform sparsity" is encoded in a sparsity prior, in our case the product of Laplacians (2). As opposed to a Gaussian, a Laplace distribution concentrates more mass close to zero, forcing coefficients to be very small. On the other hand, the Laplacian also has more mass in the tails, which allows for occasional large values. These points are explained further in (Seeger, 2008; Tipping, 2001). rough rules of thumb have to be followed to obtain a design ("make it random" in CS), or many measurements have to be taken in a trial-and-error fashion. In Bayesian experimental design, a permanently refined uncertainty representation is used to avoid uninformative data sampling, so often many fewer real measurements are required. 3. Approximate Inference Bayesian inference is in general not analytically tractable for models of the form (1), and has to be approximated. Moreover, the applications of interest here demand a high efficiency in many dimensions (n = 4096 in the natural image experiments here). Importantly, Bayesian experimental design does not only require inference just once, but many times in a sequential fashion. We make use of the expectation propagation (EP) method (Minka, 2001), together with a robust and efficient representation for Q(u ) P (u |y ). Our framework has previously been used in a different context (Seeger, 2008), where details can be found which are omitted here. As a novelty, we will show here how the framework can be run efficiently for large n, and how sequential design optimisation can be done orders of magnitude faster. In EP, the posterior P (u |y ) is approximated by a Gaussian Q(u ) with free (variational) parameters b , , which are formally introduced by replacing ti (si ) 2 ~ by ti (si ) = ebi si -i si /2 in (1). The distribution Q(u ) is represented by lower triangular L and , L LT = -2 X T X + B T B = CovQ [u ]-1 , = L-1 ( -2 X T y + B T b ), = diag , Next, the design is X , the measurement matrix. In our example, each row of X is a linear filter specifying a single image measurement. In this paper, we assume that all rows of X have unit norm5 . The problem of experimental design is how to choose X among many candidates of the same cost, so that subsequent measurements allow for the best reconstruction of u . This decision has to be taken without doing real measurements for most candidates. In a Bayesian variant, the posterior distribution P (u |y ) encodes all present knowledge. To score a candidate X (new rows of X ), assume for the moment that the outcome y is known. We can measure the decrease in uncertainty from P (u |y ) to P (u |y , y ) by the entropy difference H[P (u |y )] - H[P (u |y , y )]. Not knowing y , we inP tegrate it out using P (y |y ) = (y |u )P (u |y ) du . This expected information score drives the optimisation of the design. It is clear that such scores are fundamentally based on the posterior as representation of uncertainty, so that algorithms which merely estimate good solutions from given data cannot be used directly in order to compute them6 . With such methods, either 4 In our experiments, we use an extended notion of sparsity, see Section 3.2. 5 When designing X , it is important to keep its rows of the same scale. Otherwise, a measurement can always be improved (at fixed noise level 2 ) simply by increasing its norm. Put differently, we place a prior on X which is uniform over all matrices with rows of unit norm. 6 It is one thing to learn to predict well, yet a different issue to estimate its own uncertainty well, and methods so that EQ [u ] = L-T . The (bi , i ) are then updated sequentially by matching the Gaussian moments of the tilted distributions j ^ ~ ~ Pi (u ) N (y |X u , 2 I ) tj (sj )ti (si )1- ti (si ) =i with the new Q (u ). Here, (0, 1] is a fractional parameter7 . In each local update, we need to compute ^ the non-Gaussian moments of the marginal Pi (si ), and to update the Q(u ) representation, which is done by an O(n2 ) Cholesky update of L . Note that (Ji & Carin, 2007) employ the variational mean field approximation of (Tipping, 2001), which is specific to sparse linear models (more precisely, all ti have to be Gaussian employing "premature sparsification" often perform badly w.r.t. the latter (see Section 4.1). 7 = 1 gives standard EP, but choosing < 1 can increase the robustness of the algorithm on the sparse linear model significantly (Seeger, 2008). We use = 0.9 in all our experiments. Compressed Sensing and Bayesian Exp erimental Design scale mixtures, thus even functions), while EP can be applied with little modification to models with skew priors or non-Gaussian skew likelihoods as well (Gerwinn et al., 2008). In our applications of sequential design, we need to score the informativeness of new candidates x (as row of X ), which we do by the entropy difference (see Section 2). If Q is the approximate posterior after including x , then 2H[Q ] = log |CovQ [u ]| + C , where Q differs from Q in that (X )T X = X T X + x xT , and . We approximate the entropy difference by assuming that = , whence H[Q] - H[Q ] = 1 . 1 log + -2 xT CovQ [u ]x 2 only. The list is then evolved by replacing the lower half of worst-scoring sites by others randomly drawn from {1, . . . , q }. Importantly, the marginals h , can be updated along with the representation, at the expense of only one additional L backsubstitution and MVM with B . Namely, if i = i + i , b = bi + bi , i -T -1 T and w := B L (L Bi,· ), then = - i w w, 1 + i i h = h + bi - hi i w. 1 + i i T Here, L-1 Bi,· has to be computed for the L update anyway. This idea is used in the experiments described in Section 4.2. Since x = 1 by assumption, this score is maximized by choosing x along the principal (leading) eigendirection8 of CovQ [u ]. The same score is used by (Ji & Carin, 2007). 3.1. Large-Scale Applications There are two ma jor issues with trying to apply our method for large sizes n. First, the EP site updates are done in random sweeps over n sites, because it is not clear which particular site ordering leads to fastest convergence. This problem is severe in our sequential design application to natural images, since there are many small changes to X , y (individual new measurements), after each of which EP convergence has to be regained. We approach it by forward scoring many site candidates before each EP update, thereby always updating the one which gives the largest posterior change. This is detailed just below. Second, the robust Q representation of (Seeger, 2008) is of size O(n2 ), and each update costs O(n2 ). We sketch a different representation of size O(m2 ) below, which can be used to drive our framework as well. In contrast, (Ji & Carin, 2007) use a heuristic of setting many of the i to early in the iteration, which leads to much worse results than we obtain (see Section 4.1, Section 4.2). Our selective updating scheme for EP hinges on the fact that we can maintain all site marginals h , , Q(si ) = N (hi , i ), up to date at all times. For a site i, we can quantify the change of Q through an update there by D[Q (si ) Q(si )] (Q the posterior after the update at i), which can be computed in O(1). Importantly, D[Q (u ) Q(u)] = D[Q (si ) Q(si )] (because Q(u |si ) = Q (u |si )), so the score precisely measures the global amount of change Q Q . We maintain a list of candidate sites, which are scored before each EP update, and the update is done for the winner 8 For large n, storing an n×n matrix in memory becomes prohibitive. In a less costly representation, we exploit m n. We require9 that B = I . The Woodbury formula gives CovQ [u ] = -1 - -1 X T L-T L-1 X -1 , where L LT = I + X -1 X T , so L (different from above) is of size m2 only. An EP update requires O(m2 ) and two MVMs with X , rather than O(n2 ) above. While this representation is exact, it is numerically less robust to update than the O(n2 ) one. 3.2. Image Mo del. Other Metho ds In this section, we provide further details about the concrete model we use in our experiments with natural images. Our prior encourages two different notions of sparsity in an image. First, a multi-scale wavelet transform of u should be sparse, modeling the observation that natural images can be compressed well in a wavelet domain. Second, the finite differences in the horizontal and vertical direction should exhibit sparsity, accounting for spatial smoothness often found in images10 . A frequently used penalty term for the latter is the L1 norm of the image gradient, also known as total variation. Our model is an instance of (1), where all ti are Laplacian (2). s , and therefore B , decompose into two different parts: B T = (B (sp) T B (tv) T ). Equivalently, the prior is the product of two potentials. The transform sparsity potential is a sparsity prior on the wavelet coefficients of u . Note that the Laplace distribution is a sensible candidate to fit wavelet coefficient histograms from natural images (Simoncelli, 1999). Thus, More generally, B T B must be easy to invert. If B is invertible and B -1 -MVM feasible, we represent Q(s ) rather than Q(u ). 10 Recall what we mean by sparsity from Section 2: most coefficients are forced to be small, by allowing some to be large. Occasional large components in the gradient correspond to edges in the image. 9 We compute x by the Lanczos algorithm. Compressed Sensing and Bayesian Exp erimental Design B (sp) Rn×n is a multi-scale orthonormal wavelet transform, and the potential is exp(-sp B (sp) u 1). The total variation potential is a Laplace prior on the image gradient, i.e. the differences between horizontal and vertical pixel neighbours11 . B (tv) R2(n- n)×n is a sparse structured matrix, mapping the image u to its gradient. Here, we assume that n = 22k for simplicity. The total variation potential is exp(-tv B (tv) u 1). Therefore, we have q 3n for the size of s . Also, the potentials come with different scale parameters sp , tv . Importantly, neither of B (sp) , B (tv) has to be stored in memory, and MVM with B or B T can be done in O(n). We also briefly describe the methods we compare against. Most of them come with a transform sparsity potential only, so that s = B (sp) u . The method of (Ji & Carin, 2007) is called SBL here. In Lp ^ reconstruction, s = argmin{ s p | X B (sp) T s = y }, ^ ^ u = B (sp) T s . For L2 we just solve the normal equations, while for L1 this is a linear program. Note that the latter is used in many CS publications (Cand`s e et al., 2006; Donoho, 2006). A method with transform sparsity and total variation potential, called L1 + TV here, is given by the following quadratic pro1 ^ gram: u = argmin 2 y - X u 2 + sp 2 B (sp) u 1 + 2 tv 2 B (tv) u 1 (Cand`s & Romberg, 2004). We used e the following code in our experiments: SBL: www.ece.duke.edu/shji/BCS.html L1 : www.acm.caltech.edu/l1magic/ L1 + TV: www.stanford.edu/mlustig/ that studies looking at the robustness of CS theoretical claims are highly important. In this section, signals are sparse as such, so that B = I and u = s here. We compare methods described in Section 3.2. It is important to stress that all methods compared here (except for L2 ) are based on exactly the same underlying model (1) with B = I , and differences arise only in the nature of computations (approximate Bayesian versus maximum a-posteriori optimisation) and in whether X is sequentially designed (EP, SBL) or chosen at random (Lp reconstruction; we follow CS theory (Cand`s e et al., 2006; Donoho, 2006) and sample rows of X uniformly of unit norm). Results are shown in Figure 1. a) Gaussian 1.5 1.4 1.2 1.0 0.8 0.7 40 50 1.5 1.4 1.2 1.0 0.8 0.7 40 50 1.5 L2 (rand) L1 (rand) 1.0 1.0 1.5 1.4 1.2 1.0 0.8 0.7 40 50 1.5 SBL (rand) SBL* (opt) 1.0 b) Laplacian 75 100 125 150 180 75 100 125 150 180 c) Student's t Reconstruction error d) Decaying 0.5 75 100 125 150 180 0.0 40 50 1.5 75 100 125 150 180 e) Uniform sparse f) Random sparse EP* (opt) 0.5 0.5 4. Exp eriments In this section, we provide experimental results for different instances of our framework, comparing to CS and approximate Bayesian methods on synthetic data (Section 4.1), and on the task of measuring natural images (Section 4.2). 4.1. Artificial Setups It is customary in the CS literature to test methods on synthetic data, generated following the "truly sparse and otherwise unstructured" assumptions under which asymptotic CS theorems are proven. We do the same here, explicitly using the "(non-)uniform spikes" (Ji & Carin, 2007), but cover some other heavy-tailed distributions as well. It seems that not many signals of real-world interest are strictly and randomly sparse, so This potential on its own is not normalisable as distribution over u , being invariant against adding a constant to all pixels. 11 0.0 40 50 75 100 110 0.0 40 50 75 100 110 Number of measurements Figure 1. Comparison on 6 random synthetic signals u R512 . Shown are L2 -reconstruction errors (mean±std.dev. over 100 runs). All methods start with same random initial X (m = 40), then "(rand)" add random rows, "(opt)" optimise new rows sequentially. Noise variance 2 = 0.005, prior scale = 5. SBL: (Ji & Carin, 2007), Lp: Lp reconstruction, EP: our method. (a-c): i.i.d. zero mean, unit variance Gaussian, Laplacian (Eq. 2), Student's t (3 d.o.f.). (d): n of ui = 0, n exponential decay 1, . . . , 0, n minus 2 4 4 that, randomly permuted. (e-f ): 20 ui = 0 at random; (e) uniform spikes, ui {±1}; (f ): non-uniform spikes, 1 ui 4 + |t|, t N (0, 1); as in (Ji & Carin, 2007). Distributions in (d-f ) normalised to unit variance. The "sparsity" (or super-Gaussianity) of the signal distributions increases from (1a) to (1e-f ). For Gaussian signals (1a), L2 reconstruction based on random measurements is optimal. While all CS methods and SBL Compressed Sensing and Bayesian Exp erimental Design (random and designed) lead to large errors, EP with design matches the L2 results, thus shows robust behaviour. For Laplacian and Student's t signals (1bc), designed EP outperforms L2 reconstruction significantly, while even the CS L1 method still does worse than simple least squares. SBL performs poorly in all three cases with signals not truly sparse, thus is not robust against rather modest violations of the strict CS assumptions. Its non-robustness is also witnessed by large variations across trials. On the other hand, L2 performs badly on truly sparse signals. In all cases (1d-f ), EP with design significantly outperforms all other methods, including designed SBL, with special benefits at rather small numbers of measurements. SBL does better now with truly sparse signals, and is able to outperform L1 . From the superior performance of EP with design on all signal classes, we conclude that experimental design can sequentially find measurements that are significantly better than random ones, even if signals are truly sparse. Moreover, the superior performance is robust against large deviations away from the underlying model, more so even than classical L1 or L2 estimation. The poor performance of SBL (Ji & Carin, 2007) seems to come from their desire for "premature sparsification". During their iterations, many i are clamped to + early for efficiency reasons. This does not hurt mean predictions from current observations much, but affects their covariance approximation drastically: most directions not supported by the data right now are somewhat ruled out for further measurements, since posterior variance along them (which should be large!) is shrunk in their method. In contrast, in our EP method, none of the i become very large with modest m, and our covariance approximation seems good enough to successfully drive experimental design. Without premature sparsification, our scheme is still efficient, since the most relevant site updates are found actively, and the need to eliminate variables does not arise. 4.2. Natural Images In this section, we are concerned about finding linear filters which allow for good reconstruction of natural images from noisy measurements thereof. Since natural images exhibit sparsity in wavelet or Fourier domains, CS theory seems to suggest that random measurements should be well-suited for this purpose, and there have been considerable efforts to develop hardware which can perform such random measurements cost-efficiently (Duarte et al., 2008). On the other hand, much is known about low level natural image statistics, and powerful linear measurement transforms have emerged there, such as multi-scale wavelet transforms, based on which natural image reconstruction should be substantial better than for random measurements (Weiss et al., 2007). The sparsity of images in a wavelet domain is highly structured, there is a clear ordering among the coefficients from coarse to fine scales: natural images typically have much more energy in the coarse-scale coefficients, and coefficients with very small values are predominantly found in the fine scales. In our experiments, we employ a simple heuristic for linearly measuring images, called wavelet heuristic in the sequel: every measurement computes a single wavelet coefficient, and the sequential ordering of the measurements is deterministic top-down, from coarse to fine scales12 . This ordering is a pragmatic strategy: if mainly the coarse-scale coefficients are far from zero, they should be measured first13 . Do state-of-the-art CS reconstruction algorithms, based on random linear image measurements, perform better than simple L2 reconstruction based on the wavelet heuristic? And how does Bayesian sequential design perform on this task, if the model described in Section 3.2 is used? Note that no prior knowledge about typical ordering or dependence among wavelet coefficients in encoded in this model either. Results of our study are given in Figure 2. In fact, we started our exploration with what is shown in (2a), where 100 initial filters are drawn at random (except for L2 (heur)). Intrigued by the fact that the wavelet heuristic method L2 (heur) outperformed all CS variants significantly, we tried to give them a headstart, supplying m = 100, 200, 400 wavelet heuristic measurements initially (2b-d). However, the systematic under-performance of methods which have sparsity regularizers built in, yet do random rather than wavelet measurements, remains consistently present. From these results we conclude, much as (Weiss et al., 2007) argued on theoretical grounds, that if natural images are to be measured successively by unit norm, but otherwise unconstrained linear filters, then drawing these filters at random leads to significantly worse 12 This ordering follows the recursive definition of such transforms: downsampling by factor two (coarse), horizontal differences, vertical differences, diagonal corrections at each stage. Our ordering is coarse horizontal vertical diagonal, descending just as the transform does. 13 Note that another problem with common CS assumptions applied to images is that the typical scale of coefficients along a coarse-to-fine ordering follows a smooth power law, it does not exhibit the abrupt drop from "significantly above noise level" to "exactly zero" often required by CS theory. Compressed Sensing and Bayesian Exp erimental Design a) 100 random initial measurements 18 16 14 7 12 10 8 6 6 5 4 3 1024 100 200 SBL (rand) SBL* (opt) 8 9 b) 100 initial Wavelet measurements 4 100 200 400 600 800 400 600 800 1024 c) 200 initial Wavelet measurements 9 8 7 6 5 4 d) 400 initial Wavelet measurements L1 (rand) L1+TV (rand) L2 (heur) EP (heur) EP* (opt) 9 8 7 6 5 4 3 100 200 400 600 800 3 1024 100 200 400 600 800 1024 Number of measurements EP(heur) is doing EP reconstruction, but based on the same measurements as L2 (heur). While it slightly outperforms L2 reconstruction, the significant difference is due to the choice of the measurements. Our method therefore provides an efficient solution to the problem posed in (Weiss et al., 2007), namely how to learn measurements automatically from data, starting from little concrete domain knowledge. On the particular problem of measuring images linearly, our findings should be put into perspective, by noting that the L2 wavelet heuristic is vastly faster to compute15 . Moreover, X is optimised sequentially, particular to the image u (but without knowing the underlying u ), while the wavelet heuristic filters are always the same. Finally, the final X is is dense and unstructured. However, our method can be used in the same way to address applications where strong structural constraints on allowable X are present, and where wavelet (or purely random) measurements are not an option. In this setting, SBL (Ji & Carin, 2007) performs much worse than all other methods tried, whether using random, wavelet or designed measurements. Results for SBL in cases (b-d) were even worse and are not included to facilitate comparison among the others. This is most probably an extreme instance of the problem noted in Section 4.1. Premature sparsification, in light of not strictly sparse signals, leads to poor results even with random X . Their covariance estimates seem too bad to steer sequential design in a useful direction16 . Finally, the deterioration of L1 , when adding random to initial wavelet measurements, is somewhat puzzling, especially since it does not happen for L1 + TV. These additional measurements provide novel information about the true u , so a valid inference method should rather improve. Figure 2. Experiments for measuring natural images (64 × 64 = 4096 pixels). Shown are L2 -reconstruction errors averaged over 25 grayscale images typically used in computer vision research (from decsai.ugr.es/cvg/dbimagenes/) (± 1 std.dev. for ""). Noise level 2 = 0.005. SBL: (Ji & 4 Carin, 2007), Lp: Lp reconstruction, L1 + TV: Lasso with TV/wavelet penalties, EP: our method. True 2 supplied, parameters chosen optimally for each method individually: sp = tv = 0.075 (L1 + TV), sp = 0.075, tv = 0.5 (EP). New rows of X random unit norm (rand), actively designed (opt), acc. to wavelet heuristic (heur). (a): Start with m = 100, X random unit norm. (b-d): Start with m = 100, 200, 400, X acc. to wavelet heuristic. reconstructions than using standard wavelet coefficient filters top-down. While CS theorems are mathematically intriguing, and while there certainly are important applications that benefit from these results14 , linear image measurement is probably not among them. On the other hand, the wavelet heuristic method is significantly outperformed by our EP method, where X is designed sequentially. In (2a), EP quickly recovers from the suboptimal initial random X . Moreover, even when started from the same point as the wavelet heuristic (2b-d), the designed measurements lead to improvements over the heuristic immediately. 14 The theoretical CS setting is more extreme than what is really required here, in that there is no prior knowledge about where the non-zeros will lie. We speculate that more suitable applications could lie in steganography, spam or intrusion detection, where a signal has to be detected which has been hidden by an adversary. Reconstruction error 5. Discussion We have shown how to address the compressive sensing problem with Bayesian experimental design, where designs are optimised to rapidly decrease uncertainty and do not have to be chosen at random. In a large study 15 EP sequential design is still very efficient. A typical run on one image took 53 min (on 64bit 2.33GHz AMD), for n = 4096 and q = 12160 sites: 16785 initial EP updates, then 308 increments of X by 3 rows each, with on average only 8.8 site updates needed to regain EP convergence (up to 85 updates after some increments). 16 In cases (b-d), top wavelet coefficients are measured initially, so their method confidently starts with a highly over-sparse solution and fails. Note that, as opposed to EP, we restarted the SBL code after each new measurement, so that poor current solutions are not inherited when new data is obtained. Compressed Sensing and Bayesian Exp erimental Design about linearly measuring natural images, we show that CS reconstruction methods based on randomly drawn filters are outperformed significantly by standard least squares reconstruction measuring coarse-scale wavelet coefficients. Our findings suggest that the applicability of CS results (with their insistence on strict and unstructured signal sparsity) to natural image applications should be reconsidered. We also show that our Bayesian sequential design method, starting from a model with little domain knowledge built in, is able to find filters with significantly better reconstruction properties than top-down wavelet coefficients. Our findings indicate that efficient Bayesian experimental design techniques are highly promising for CS applications of different kinds just as well. Why do random measurement filters enjoy good properties in CS theory, but are not useful in the case of natural images? We think that this seeming contradiction really comes from an erroneous "extrapolation" of what CS theorems really mean. Any structure apart from a randomly distributed sparsity pattern is ignored there. Also, they are minimax results, in that the reconstruction error for the worst sparsity pattern is bounded. But undersampled image reconstruction is not a worst-case problem, and much is known about the sparsity structure of natural images. It may be that L1 or L1 + TV are minimax methods (for known B ), but that does not imply much about their typical performance. We suspect that our doubts about CS with random measurements extend beyond natural images to other signals of common interest in normal non-adversarial situations, since interest in a signal class implies that statistical knowledge about them beyond random sparsity has been obtained. Our experience with the method of (Ji & Carin, 2007), which we compare against in our study, raises another more speculative, yet interesting point. Several methods very frequently used in machine learning today can loosely be summarised as trying to detect very sparse solutions early on, mainly with the aim of high computational efficiency. For example, SBL (Tipping, 2001) is much more aggressive in this respect than our EP method here. Early sparsification does not seem to hurt mean prediction performance much, and thus is embraced for efficiency. However, our experiences here indicate that it is the covariance (or uncertainty) estimates that can be badly hurt by such sparsity-byelimination processes, and that in contexts such as experimental design, where covariances are more important than predictive means, their application should probably be avoided. The challenge is then to develop methods that run efficiently without eliminating many variables early on, and our selective site updat- ing method for EP is a step in that direction. References Cand`s, E., & Romberg, J. (2004). Practical signal e recovery from random pro jections. Proceedings of SPIE. Cand`s, E., Romberg, J., & Tao, T. (2006). Roe bust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theo., 52, 489­509. Donoho, D. (2006). Compressed sensing. Trans. Inf. Theo., 52, 1289­1306. I EEE Duarte, M., Davenport, M., Takhar, D., Laska, J., Sun, T., Kelly, K., & Baraniuk, R. (2008). Single pixel imaging via compressive sampling. To appear in IEEE Signal Processing Magazine. Elad, M. (2007). Optimized pro jections for compressed sensing. IEEE Transactions on Signal Processing. Gerwinn, S., Macke, J., Seeger, M., & Bethge, M. (2008). Bayesian inference for spiking neuron models with a sparsity prior. Advances in NIPS 20. Ji, S., & Carin, L. (2007). Bayesian compressive sensing and pro jection optimization. Proceedings of ICML 24. Minka, T. (2001). Expectation propagation for approximate Bayesian inference. Uncertainty in AI 17. Seeger, M. (2008). Bayesian inference and optimal design in the sparse linear model. To appear in Journal of Machine Learning Research. Seeger, M., Steinke, F., & Tsuda, K. (2007). Bayesian inference and optimal design in the sparse linear model. AI and Statistics 11. Simoncelli, E. (1999). Modeling the joint statistics of images in the Wavelet domain. Proceedings 44th SPIE (pp. 188­195). Tipping, M. (2001). Sparse Bayesian learning and the relevance vector machine. J. M. Learn. Res., 1, 211­ 244. Weiss, Y., Chang, H., & Freeman, W. (2007). Learning compressed sensing. Snowbird Learning Workshop, Allerton, CA.