Multi-Task Compressive Sensing with Dirichlet Process Priors
Yuting Qi YUTING@EE.DUKE.EDU Dehong Liu LIUDH@EE.DUKE.EDU Department of Electrical and Computer Engineering, Duke University, Durham, NC, 27708 USA David Dunson Department of Statistical Science, Duke University, Durham, NC, 27708 USA
D U N S O N @ S TAT. D U K E . E D U

Lawrence Carin LCARIN@EE.DUKE.EDU Department of Electrical and Computer Engineering, Duke University, Durham, NC, 27708 USA

Abstract
Compressive sensing (CS) is an emerging £eld that, under appropriate conditions, can signi£cantly reduce the number of measurements required for a given signal. In many applications, one is interested in multiple signals that may be measured in multiple CS-type measurements, where here each signal corresponds to a sensing "task". In this paper we propose a novel multitask compressive sensing framework based on a Bayesian formalism, where a Dirichlet process (DP) prior is employed, yielding a principled means of simultaneously inferring the appropriate sharing mechanisms as well as CS inversion for each task. A variational Bayesian (VB) inference algorithm is employed to estimate the full posterior on the model parameters.

small even for N m. Conventional techniques require one to measure the m-dimensional signal u but £nally discard m - N coef£cients (Charilaos, 1999). This samplethen-compress framework is often wasteful since the signal acquisition is potentially expensive, and only a small amount of data N is eventually required for the accurate ^ approximation u. One may therefore consider the following fundamental question: Is it possible to directly measure the informative part of the signal? Recent research in the £eld of compressive sensing shows that this is indeed possible (Candes, 2006)(Donoho, 2006). Exploiting the same sparseness properties of u employed in transform coding (u =  with  sparse), in compressive sensing one measures v =  , where v is an n-dimensional vector with n < m, and  is the n × m sensing matrix. There are several ways in which  may be constituted, with the reader referred to (Donoho, 2006) for details. In most cases  is represented as  = T, where T is an n × m matrix with components constituted randomly (Tsaig & Donoho, 2006); hence, the CS measurements correspond to projections of u with the rows of T : v = Tu = T =  , which is an underdetermined problem. Assuming the signal u is N -sparse in , implying that the coef£cients  only have N nonzero values (Candes, 2006) (Donoho, 2006), Candes, Romberg ` and Tao in (Candes et al., 2006) show that, with overwhelming probability,  (and hence u) is recovered via min || ||l1 , s.t., v =  , (1)

1. Introduction
Over the last two decades researchers have considered sparse signal representations in terms of orthonormal basis functions (e.g., the wavelet transform). For example, consider an m-dimensional real-valued signal u and assume an m × m orthonormal basis matrix ; we may then express u =  , where  is an m-dimensional column vector of weighting coef£cients. For most natural signals there exists an orthonormal basis  such that  is sparse. Consider ^ ^ ^ now an approximation to u, u =  , where  approximates  by retaining the largest N coef£cients and setting the remaining m - N coef£cients to zero; due to the afore^ mentioned sparseness properties, ||u-u||2 is typically very
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).

if the number of CS measurements n > C · N · log m (C is a small constant); if N is small (i.e., if u is highly compressible in the basis ) then n m. In practice the signal u is not exactly sparse, but a large number of coef£cients in the basis  may be discarded with minimal error in reconstructing u; in this practical case the CS framework has also been shown to operate effectively.


Multi-Task Compressive Sensing with Dirichlet Process Priors

The problem in (1) may be solved by linear programming (S. Chen & Saunders, 1999) and greedy algorithms (Tropp & Gilbert, 2005) (Donoho et al., 2006). A Bayesian compressive sensing (BCS) methodology is proposed in (Ji et al., 2007b), by posing the CS inversion problem as a linear-regression problem with a sparseness prior on the regression weights  . One advantage of BCS is that this framework may be extended to multi-task compressive sensing (Ji et al., 2007a), in which each CS measurement v i = i  i represents a sensing "task" and the objective is to jointly invert for all { i }i=1,M , through an appropriate sharing of information between the M data collections. In multi-task CS, one may potentially reduce the number of measurements required for each task by exploiting the statistical relationships among the tasks, for example, "Distributed Compressed Sensing" (DCS) (Baron et al., 2005), an empirical Bayesian strategy "Simultaneous Sparse Approximation" in (Wipf & Rao, 2007), and a hierarchical Bayesian model for multi-task CS (Ji et al., 2007a). However, these multi-task algorithms assume all tasks are appropriate for sharing, which may not be true in many practical applications. In this paper we introduce a Dirichlet process (DP) prior (West et al., 1994) to the hierarchical BCS model, which can simultaneously perform the inversion of the underlying signals and infer the appropriate sharing/clustering structure across the M tasks. As detailed below, an important property of DP for the work presented here is that it provides a tool for semiparametric clustering (i.e., the number of clusters need not be set in advance). The DP-based hierarchical model is employed to realize the desired property of simultaneously clustering and CS inversion of the M measurements {v i }i=1,M . A variational Bayes (Blei & Jordan, 2004) inference algorithm is considered, yielding a full posterior over the model parameters  i .

when performing the CS inversion. We impose a hierarchical sparseness prior on the parameters  i , the lower level of which is p( i |i ) = jm
-1 N (i,j |0, i,j ),

(3)

=1

where i,j is the j th component of the vector i . To impose sparseness, on a layer above a Gamma hyperprior is employed independently on the precisions i,j . The likelihood function for the parameters  i and 0 , given the CS measurements v i , may be expressed as p(v i | i , 0 ) = ( 0 2 - ni ) 2 exp(- v i - i  i 2 ). (4) 2 0 2

Concerning the aforementioned hyperprior, for the multitask CS model proposed in (Ji et al., 2007a),m he paramet ters i = , for i = 1, · · · , M , and   j =1 Ga(c, d). In this framework the CS data from all M tasks are used to jointly infer the hyper-parameters  (global processing). However, the assumption in such a setting is that it is appropriate to employ all of the M tasks jointly to infer the hyper-parameters. One may envision problems for which the M tasks may be clustered into several sets of tasks (with the union of these sets constituting the M tasks), and data sharing may only be appropriate within each cluster. Through use of the Dirichlet process (DP) (Escobar & West, 1995) employed as the prior over i , we simultaneously cluster the multi-task CS data, and within each cluster the CS inversion is performed jointly. Consequently, we no longer need assume that all CS data from the M tasks are appropriate for sharing. 2.2. Dirichlet Process for Clustered Sharing The Dirichlet process, denoted as DP (, G0 ), is a measure on measures, and is parameterized by a positive scaling parameter  and the base distribution G0 . Assume we have {i }i=1,M and each i is drawn identically from G, and G itself is a random measure drawn from a Dirichlet process, i |G G  
iid

2. Multi-Task CS Modeling with DP Priors
2.1. Multi-Task CS Formulation for Global Sharing Let v i represent the CS measurements associated with task i, and assume a total of M tasks. The i-th CS measurement may be represented as v i = i  i + i, (2)

G,

i = 1, · · · , M , (5)

where the CS measurements v i are characterized by an ni dimensional real vector, the sensing matrix i corresponding to task i is of size ni × m, and  i is the set of (sparse) transform coef£cients associated with task i. The j th coef£cient of  i is denoted i,j . The residual error vector i  Rni is modeled as ni i.i.d. draws from a zero-mean Gaussian distribution with an unknown precision 0 (variance 1/0 ); the residual corresponds to the error imposed by setting the small transform coef£cients exactly to zero

DP (, G0 ),

where G0 is a non-atomic base measure. Sethuraman (Sethuraman, 1994) provides an explicit characterization of G in terms of a stick-breaking construction. Consider two in£nite collections of independent random variables k and  , k = 1, 2, · · · , , where the k are k drawn i.i.d. from a Beta distribution, denoted B eta(1, ), and the  are drawn i.i.d. from the base distribution G0 . k


Multi-Task Compressive Sensing with Dirichlet Process Priors

The stick-breaking representation of G is then de£ned as k k with (6) (7)

distribution over  i is then p( i |c, d) = jm N
   (i,j |0, k,j -1 )Ga(k,j |c, d)dk,j .

G

=

wk  , k
k-1 i =1

=1

=1

wk

=

(1 - i ),

where k |  B eta(1, ) and  |G0  G0 . This reprek sentation makes explicit that the random measure G is discrete with probability one and the support of G consists of an in£nite set of atoms located at   , drawn independently k from G0 . The mixing weights wk for atom  are given by k successively breaking a unit length "stick"  to an in£nite in number of pieces, with 0  wk  1 and k=1 wk = 1. 2.3. Multi-Task CS with DP Priors We employ a DP prior with stick-breaking representation for i in the model in (3), which assumes that i |G  G   and G = k=1 wk k . The base distribution G0 corresponds to the sparseness promoting representation discussed in Sec 2.1. To facilitate posterior computation we introduce an indicator variable zi with zi = k indicating i =  . Therefore the DP multi-task CS model is exk pressed as v i | i , 0 i,j |zi , { }k=1,K k zi |{wk }k=1,K wk k |e, f  |c, d k 0    =
iid iid - N (i  i , 0 1 I ),  N (0, zi ,j -1 ),

iid

iid

(9) Equation (9) is a type of automatic relevance determination (ARD) prior which enforces the sparsity over  i (Tipping, 2001). We usually set c and d very close to zero (e.g., 10-4 ) to make a broad prior over  , which allows the posteriors k on many of the elements of  to concentrate at very large k values, consequently the posteriors on the associated elements of  i concentrate at zero, and therefore the sparseness of  i is achieved (MacKay, 1994) (Neal, 1996). Since these posteriors have "heavy tails" compared to a Gaussian distribution, they allow for more robust shrinkage and borrowing of information. Similarly, hyper-parameters a, b, e, and f are all set to a small value to have a non-informative prior over 0 and  respectively.

3. Variational Bayesian Inference
One may perform inference via MCMC (Gilks et al., 1996), however this requires vast computational resources and MCMC convergence is often dif£cult to diagnose (Gilks et al., 1996). Variational Bayes inference is therefore introduced as a relatively ef£cient method for approximating the posterior. From Bayes' rule, we have p(H|V, ) = p(V|H)p(H|) p , (V|H)p(H|)dH (10)

M ultinomial({wk }k=1,K ), k
k-1 l =1

(1 - l ),

 B eta(1, ),  Ga(e, f ), jm iid  Ga(c, d),
=1

where V = {v i }i=1,M are CS measurements from M CS tasks, H = {0 , ,  , {zi }i=1,M , { i }i=1,M , { }k=1,K } are hidden variables (with  = {k }k=1,K ) k and  = {a, b, c, d, e, f } are known hyper-parameters. The integration in the denominator of (10), called the marginal likelihood, or "evidence" (Beal, 2003), is generally intractable to compute analytically. Instead of directly estimating p(H|V, ), variational methods seek a distribution q (H) to approximate the true posterior distribution p(H|V, ). Consider the log marginal likelihood log p(V|) = F (q (H)) + DK L (q (H)||p(H|V, )), (11) where q p(V|H|, )p(H, ) dH, (12) F (q (H)) = (H) log q (H) and DK L (q (H)||p(H|V, )) is the KL divergence between q (H) and p(H|V, ). The approximation of p(H|V, ) using q (H) can be achieved by maximizing F (q (H)), which forms a strict lower bound on log p(V|). In this way estimation of q (H) may be made computationally tractable. In particular, for computational convenience,


Ga(a, b),

(8)

where i = 1, · · · , M , j = 1, · · · , m, k = 1, · · · , K , 1  K  , and i,j is the j -th element of i . For convenience, we denote the model in (8) as DP-MT CS. In practice K is chosen as a relatively large integer (e.g., K = M if M is relatively large) which yields a negligible difference compared to the true DP (Ishwaran & James, 2001), while making the computation practical. The choice of G0 here is consistent with the sparsenesspromoting hierarchical prior discussed in Section II-A. Consider task i and assume i takes value  ; the prior k


Multi-Task Compressive Sensing with Dirichlet Process Priors

q (H) is expressed in a factorized form, with the same functional form as the priors p(H|). For the model in (8), we assume q (H) = q (0 )q ()q ( ) iM q (zi ) iM q ( i ) kK q ( ), k

2 0 -2 2 0 -2 2 0 -2 2 0 -2 2 0 -2 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200

2 0 -2 2 0 -2 2 0 -2 2 0 -2 2 0 -2 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200

(13) where q (0 )  Ga(a, ~), q ()  Ga(e, f ), q ( )  ~b ~~ K -1  M ultinomial(w), k=1 B eta(1k , 2k ), q (zi ) m   ~ q ( i )  N (µi , i ), q (k )  j =1 Ga(k,j |ck,j , dk,j ), ~ with w = {wk }k=1,K . By substituting (13) and (8) into (12), the lower bound F (q ) is readily obtained. The optimization of the lower bound F (q ) is realized by taking functional derivatives with respect to each of the q (·) distributions while £xing the other q distributions, and setting  F (q )/ q (·) = 0 to £nd the distribution q (·) that increases F (Beal, 2003). The update equations for the variational posteriors are summarized in the Appendix. The convergence of the algorithm is monitored by the increase of the lower bound F . One practical issue of the variational Bayesian inference is that the VB algorithm converges to a local maximum of the lower bound of the marginal log-likelihood since the true posterior usually is multi-modal. Therefore the average of multiple runs of the algorithm from different starting points may avoid this issue and yield better performance.

=1

=1

=1

Figure 1. Ten template signals for 10-cluster case.
70 DPMT CS 60 50 Avg. recon. error (%) 40 30 20 50 10 0 0 50 90 100 110 120 130 # of measurements (a) 140 150 0 4 5 6 7 8 9 10 11 12 13 14 15 # of clusters (b) 4 5 6 7 8 9 10 11 12 13 14 15 # of Measurements=140 MT CS
*

# of Measurements=100 50 0 50 0 50 0 4 5 6 7 8 9 10 11 12 13 14 15 # of Measurements=130 4 5 6 7 8 9 10 11 12 13 14 15 # of Measurements=120 4 5 6 7 8 9 10 11 12 13 14 15 # of Measurements=110

Figure 2. Multi-task CS inversion error (%) for DP-MT and MT CS for the ten-cluster case. (a) Reconstruction errors, (b) histograms of the number of clusters yielded by DP-MT.

4. Experimental Results
4.1. Synthetic data In the £rst set of examples we consider synthesized data to examine the sharing mechanisms associated with the DPMT CS inversion. In the £rst example we generate data with 10 underlying clusters. Figure 1 shows ten "templates", each corresponding to a 256-length signal, with 30 non-zero components (the values of those non-zero components are randomly drawn from N (0, 1)). The non-zero locations are chosen randomly for each template such that the correlation between these sparse templates is zero. For each template, £ve sparse signals (each with 256 samples) are generated by randomly selecting three non-zero elements from the associated template and setting the coef£cients to zero, and three zero-amplitude points in the template are randomly now set to be non-zero (each of these three non-zero values again drawn from N (0, 1)). In this manner the sparseness properties of the £ve signals generated from a given template are highly related, and the ten clusters of sparse signals have distinct sparseness properties. For each sparse signal a set of CS random projections are performed, with the components of each projection vector drawn randomly from N (0, 1)(Donoho, 2006). The re^ ^ construction error is de£ned as || u - u||2 /||u||2 , where u is the recovered signal and u is the original one.

Figure 2 shows the reconstruction errors of the CS inversion by DP-MT CS as well as the global-sharing MT CS discussed in Sec 2.1 (denoted as MT CS for simplicity), as a function of the number of CS measurements. Both CS algorithms are based on the VB DP-MT algorithm described in Sec 3, however for MT , we set i,1 = 1, and i,k = 0 for k > 1 for all tasks and £x the values of  i,k in each iteration without update. The experiment was run 100 times (with 100 different random generations of random projection as well as initial membership), and the error bars in Figure 2 represent the standard deviation about the mean. From Figure 2 the advantage of the DP-based formulation is evident. In Figure 2 we also present histograms for the number of different clusters inferred by the DP-MT CS. It is clear from Figure 2 that the algorithm tends to infer about 10 clusters, but there is some variation, with the variation in the number of clusters increasing with decreasing number of CS measurements. To further examine the impact of the number of underlying clusters, we now consider examples for which the data are generated for 5, 3, 2 and 1 underlying. For each of templates, £ve sparse signals are generated randomly, in the manner discussed above for the ten-cluster case. In Figures 3-6 are shown results in the form considered in Figure 2, for the case of 5, 3, 2 and 1 underlying clusters for data generation. One notes the following phenomenon: As the number of underlying clusters diminishes, the difference between DP-MT and MT CS algorithms diminishes, with almost identical performance witnessed for the case


Multi-Task Compressive Sensing with Dirichlet Process Priors
60 DP MT CS 50 40 30 50 20 10 0 50 90 100 110 120 130 # of measurements (a) 140 150 0 0 1 2 3 4567 # of clusters (b) 8 9 10
90 100 110 120 130 # of measurements (a) 140 150

# of Measurements=100 50 0 50 0 0 1 2345678 # of Measurements=120 9 10 0 1 2345678 # of Measurements=110 9 10
Avg. recon. error (%) 25 20 15 50 10 5 0 50 0 0 1 0 50 0 0 1 0 1 DPMT CS MT* CS 50 0 50 0 0 1 0 1

# of Measurements=100

MT* CS

2 3 4 5 6 7 # of Measurements=110

8

Avg. recon. error (%)

2 3 4 5 6 7 # of Measurements=120

8

0 50 0

0

1

2345678 # of Measurements=130

9 10

2 3 4 5 6 7 # of Measurements=130

8

0

1

2345678 # of Measurements=140

9 10

2 3 4 5 6 7 # of Measurements=140

8

2

3 4 5 # of clusters (b)

6

7

8

Figure 3. Multi-task CS inversion error (%) for DP-MT and MT CS for the £ve-cluster case. (a) Reconstruction errors, (b) histograms of the number of clusters yielded by DP-MT.

Figure 4. Multi-task CS inversion error (%) for DP-MT and MT CS for the three-cluster case. (a) Reconstruction errors, (b) histograms of the number of clusters yielded by DP-MT.
1 # of Measurements=100 DPMT CS MT* CS 1 00 50 0 1 00 50 0 1 00 50 0 1 00 50 0 1 00 50 0

of three and two clusters; this phenomenon is particularly evident as the number of CS measurements increases. As an aside, we also note that the DP-based inference of the number of underlying clusters adapts well to the underlying data generation. We now provide an explanation for the relationships between the DP-MT and MT CS algorithms. For two sparse signals like those in Figure 1, they have distinct non-zero coef£cients and therefore one would typically infer that they have dissimilar sparseness properties. However, they share many zero-amplitude coef£cients. If we consider M sparse signals, and if all of the M signals share the same large set of zero-amplitude coef£cients, then they are appropriate for sharing even if the associated (small number of) non-zero coef£cients are entirely distinct. For the 10cluster case, because of the large number of clusters, the templates do not cumulatively share the same set of zeroamplitude coef£cients; in this case global sharing for CS inversion is inappropriate, and the same is true for the 5cluster case. However, for the 3 and 2 cluster cases, the templates share a signi£cant number of zero-amplitude coef£cients, and therefore global sharing is appropriate. This underscores that global sharing across M tasks is appropriate when there is substantial sharing of zero-amplitude coef£cients, even when all of the non-zero-amplitude coef£cients are distinct. However, one typically does not know a priori if global sharing is appropriate (as it was not in Figures 2 and 3), and therefore the DP-based formulation offers generally high-quality results when global sharing is appropriate and when it is not. We consider the sharing mechanisms manifested for two examples from the three-cluster case considered in Figure 4. The truncation level K can be set either to a large number or be estimated in principle by increase the number of sticks included until the log-marginal likelihood (the lower bound) in the VB algorithm starts to decrease. In this example we choose the number of sticks in the DP formulation to K = 8 which corresponds to the upper bound of the logmarginal likelihood, and we show the stick (cluster) with which each of the 15 tasks were grouped at the end of the

0.9 0.8 Avg. recon. error (%) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 90 100 110 120 130 # of measurements (a)

0

1

2 3 4 5 # of Measurements=110

0

1

2 3 4 5 # of Measurements=120

0

1

2 3 4 5 # of Measurements=130

0

1

2 3 4 5 # of Measurements=140

140

150

0

1

2 3 4 # of clusters (b)

5

Figure 5. Multi-task CS inversion error (%) for DP-MT and MT CS for the two-cluster case. (a) Reconstruction errors, (b) histograms of the number of clusters yielded by DP-MT.
0.5 DP MT CS MT* CS 0.4 Avg. recon. error (%) 100 50 0 0.3 100 50 0 100 50 0 100 50 0 # of Measurements=100 100 50 0

0

1

2 3 4 5 # of Measurements=110

0

1

2 3 4 5 # of Measurements=120

0.2

0

1

2 3 4 5 # of Measurements=130

0.1

0

1

2 3 4 5 # of Measurements=140

0

90

100

110 120 130 # of measurements (a)

140

150

0

1

2 3 4 # of clusters (b)

5

Figure 6. Multi-task CS inversion error (%) for DP-MT and MT CS for the one-cluster case. (a) Reconstruction errors, (b) histograms of the number of clusters yielded by DP-MT.

inference process. These examples were selected because they both yielded roughly the same average CS inversion accuracy across the 15 CS inversions (0.40% and 0.38% error), but these two runs yield distinct clusterings. This example emphasizes that because the underlying signals are very sparse and they have signi£cant overlap in the set of zero-amplitude coef£cients, the particular clustering manifested by the DP formulation is not particularly important for the £nal CS-inversion quality. 4.2. Real images In the following examples, applied to imagery, we perform comparisons between DP-MT, MT , and also a single-task Bayesian CS (ST), in which the CS inversion is performed independently on each of the tasks. ST CS is realized with


Multi-Task Compressive Sensing with Dirichlet Process Priors
1 2 Cluster index 3 4 5 6 7 8 5 10 Task index (a) 15 0.2 0 0.6 0.4 1 0.8 Cluster index 1 2 3 4 5 6 7 8 5 10 Task index (b) 15 0.2 0.6 0.4 1 0.8

Task 1

Task 2
0

Figure 7. 2 example runs of the DP-MT CS clustering for the 3cluster case (100 CS measurements). The grey scale denotes the probability that a given task is associated with a particular cluster. (a) Reconstruction error was 0.40%, (b) reconstruction error of 0.38%.

Task 3

Task 4

the same algorithm as DP-MT and MT , but set i,1 = 1, and i,k = 0 for k > 1, and consider only one CS task at a time (M = 1).. We conduct two examples on CS reconstruction of typical imagery from "natural" scenes. All the images in these examples are of size 256 × 256 and are highly compressible in a wavelet basis. We choose the "Daubechies 8" wavelet as our orthonormal basis, and the sensing matrix  is constructed in the same manner as in Sec 4.1. In this experiment we adopt a hybrid CS scheme, in which using CS we measure only £ne-scale wavelet coef£cients, while retaining all coarse-scale coef£cients (no compression in the coarse scale) (Tsaig & Donoho, 2006). We also assume all the wavelet coef£cients at the £nest scale are zero and only consider (estimate) the other 4096 coef£cients. In both examples, the coarsest scale is j0 = 3, and the £nest scale is j1 = 6. We use the mean of the posterior over  to perform the image reconstruction. The reconstruction error is ^ ^ de£ned as || u - u||2 /||u||2 , where u is the reconstructed image and u is the original one. In the £rst example, we choose 12 images from three different scenes. To reconstruct the image, we perform an inverse wavelet transform on the CS-estimated coef£cients. In Figure 8 (a) we show the reconstructed images with all 4096 measurements using linear reconstruction ( = T v ), which is the best possible performance. Figure 8 (b)-(d) represent the reconstructed images by the DP-MT, MT , and the ST algorithms, respectively, with the number of CS measurements n = 1764 (1700 measurements in the £ne scales and 64 in the coarse scale) for each task. The reconstruction errors for these four methods are compared in Table 1. We notice that the DP-MT algorithm reduces the reconstruction error compared to the ST method, which indicates that the multi-task CS inversion shares information among tasks and therefore requires less measurements than the single task learning does to achieve the same performance. In addition to the CS inversion, the DP-MT also yield task clustering, with this inferred simultaneously with the CS inversion; while this clustering is not the £nal product of interest, it is informative, with results shown in Fig-

Task 5

Task 6

Task 7

Task 8

Task 9

Task 10

Task 11

Task 12

(a)

(b)

(c)

(d)

Figure 8. CS recon., (a) Linear, (b) DP-MT, (c) MT , (d) ST
1 Index of components 2 4 6 8 10 2 4 6 8 10 Index of tasks 12 0.8 0.6 0.4 0.2 0

Figure 9. Sharing mechanism for 12 tasks in Figure 8 yielded by DP-MT CS.

ures 9. Note that the algorithms infer three clusters, each corresponding to a particular class of imagery. By contrast the MT algorithm imposes complete sharing among all tasks, and the results in Table I indicate that this undermines performance.


Multi-Task Compressive Sensing with Dirichlet Process Priors

Table 1. Reconstruction error (%) for the example in 8.
DP-MT MT ST Linear Task 1 8.79 10.19 10.28 6.66 Task 2 7.89 9.14 10.37 6.20 Task 3 9.69 11.49 12.81 7.08 Task 4 8.04 9.18 10.28 6.14 Task 5 14.33 16.94 18.37 12.41 Task 6 13.22 15.59 16.18 11.70 Task 7 15.18 17.46 18.65 12.43 Task 8 14.54 16.50 17.67 11.99 Task 9 15.51 18.62 20.77 13.83 Task 10 16.71 19.82 22.24 14.41 Task 11 16.11 19.34 21.19 14.10 Task 12 15.19 18.03 19.59 13.53

In the second example we consider 11 images from three scenes. The reconstructed images are shown in Figure 10 by the linear reconstruction, DP-MT, MT and ST algorithms; the reconstruction errors are listed in Table 2 for all four methods. As expected, the multi-task CS inversion algorithm yields smaller reconstruction error than the single task algorithm. The clustering result is shown in Figure 11, in which images 1-4 and 9-11 are clustered together by DP-MT. However, recall the simple example considered in Figure 7. The DP-based algorithm seeks to share the underlying sparseness of the images, even though the images themselves may appear distinct. In fact, the results in Figure 11 motivated the simple example considered in Figure 7.

Task 1

Task 2

Task 3

Task 4

5. Conclusions
Hierarchical Dirichlet process (DP) priors are considered for the imposition of sparseness on the transform coef£cients in the context of inverting multiple CS measurements. An independent zero-mean Gaussian prior is placed on each transform coef£cient of each CS task and the taskdependent precision parameters are assumed drawn from a distribution G, where G is drawn from a Dirichlet process (DP); the base distribution of the DP is a product of Gamma distributions. The DP framework imposes the belief that many of the tasks may share underlying sparseness properties, and the objective is to cluster the CS measurements, where each cluster constitutes a particular form of sparseness. The DP formulation is non-parametric, in the sense that the number of clusters is not set a priori and is inferred from the data. A computationally ef£cient variational Bayesian inference has been considered on all model parameters. For all examples considered, the DP-MT CS inversion performed at least as well as ST CS inversion and CS inversion based on global sharing. Especially when global sharing was inappropriate, the DP-based inversion is signi£cantly better. In future research, we may consider correlation between spatially and spectrally adjacent transformation coef£cients and remove the assumption of exchangeability employed within the DP, which in practice may not be true.

Task 5

Task 6

Task 7

Task 8

Task 9

Task 10

Task 11

(a)

(b)

(c)

(d)

Figure 10. CS recon. (a) Linear, (b) DP-MT, (c) MT , (d) ST

·a ~
1 2

Appendix: Update Equations in VB DP MT
The updated hyperparameters for all q (·) in Sec 3 are

M 1 = a + 2 i=1 ni and ~ b = b+ . Mt r(i -1 T ) + (i µi - v i )T (i µi - v i ) i i i=1

K -1  ~ (2k ) -  (1k + · e = e + K - 1 and f = f - k=1 ~ ,  2k ) where  (x) =  x log (x).


Multi-Task Compressive Sensing with Dirichlet Process Priors

Table 2. Reconstruction Error (%) for the example in Figure 10
DP-MT MT  ST Linear Task 1 6.50 7.79 8.31 4.78 Task 2 6.41 7.76 8.23 4.77 Task 3 6.89 8.12 8.81 5.01 Task 4 6.86 8.32 9.23 5.15 Task 5 15.81 18.17 19.79 15.39 Task 6 15.09 17.70 19.74 14.49 Task 7 15.91 18.66 20.36 15.18 Task 8 14.74 17.13 18.88 14.06 Task 9 7.74 8.87 8.77 6.10 Task 10 8.05 9.16 9.58 5.72 Task 11 8.50 9.97 9.62 6.72

1 Index of components 2 0.8 0.6 0.4 6 0.2 8 2 4 6 8 Index of tasks 10 0

4

Donoho, D. L., Tsaig, Y., Drori, I., & Starck, J.-C. (2006). Sparse solution of underdetermined linear equations by stagewise orthogonal matching pursuit. Escobar, M. D., & West, M. (1995). Bayesian density estimation and inference using mixtures. JASA, 90, 577­ 588. Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Introducing markov chain monte carlo. In Markov chain monte carlo in practice. London, U.K.: Chapman Hall. Ishwaran, H., & James, L. F. (2001). Gibbs sampling methods for stick-breaking priros. JASA, 96, 161­173. Ji, S., Dunson, D., & Carin, L. (2007a). Multi-task compressive sensing. Submit to IEEE Trans. on SP. Ji, S., Xue, Y., & Carin, L. (2007b). Bayesian compressive sensing. Accepted to IEEE Trans. on SP. MacKay, D. J. C. (1994). Bayesian methods for backpropagation networks. In E. Domany, van J. L. Hemmen and K. Schulten (Eds.), Models of neural networks III. New York: Springer-Verlag. Neal, R. M. (1996). Bayesian learning for neural networks. Springer. S. Chen, D. L. D., & Saunders, M. A. (1999). Atomic decomposition by basis pursuit. SIAM Journal on Scienti£c Computing, 20, 33­61. Sethuraman, J. (1994). A constructive de£nition of the dirichlet prior. Statistica Sinica, 2, 639­650. Tipping, M. E. (2001). Sparse bayesian learning and the relevance vector machine. JMLR, 1, 211­244. Tropp, J. A., & Gilbert, A. C. (2005). Signal recovery from partial information via orthogonal matching pursuit. Tsaig, Y., & Donoho, D. L. (2006). Extensions of compressed sensing. Signal Processing, 86, 549­571. West, M., Muller, P., & Escobar, M. (1994). Hierarchical priors and mixture models with applications in regression and density estimation. In P. R. Freeman and A. F. Smith (Eds.), Aspects of uncertainty, 363­386. John Wiley. Wipf, D. P., & Rao, B. D. (2007). An empirical bayesian strategy for solving the simultaneous sparse approximation problem. IEEE Trans. on SP, 55, 3704­3716.

Figure 11. Sharing mechanism for 11 tasks in Figure 10 yielded by DP-MT

M e ~ · 1k = 1 + = f+ ~ i=1 i,k and 2k K M l=k+1 i,k , where i,k = q (zi = k ). i=1 M ~ · ck,j = c + 1 i=1 i,k and dk,j = d + ~ 2 M 1 2 i=1 i,k (i,j + µi,k,j ), where [i,1 , · · · , i,m ] is 2 the diagonal elements of -1 and µi,k,j is the j th elei ment of vector µi . K ~ ~ · i = k=1 i,k k + a T i and µi = a -1 T v i , i ~i ~i b b ~ ~ where k = diag (ck,1 /dk,1 , · · · , ck,m /dk,m ) is a di~ ~ agonal matrix of m × m. k-1  ei,k K , where i,k = (2l ) - · i,k = l=1 ei,l l=1 + -  (1l m 2l ) (1k ) -  (1k + 2k ) + l + 1 ~ n 2 -  (ck,j ) + ln(dk,j ) tr(-1 k ) + ~ i j =1 2 . T µ i k µ i

References
Baron, D., Wakin, M. B., Duarte, M. F., Sarvotham, S., & Baraniuk, R. G. (2005). Distributed compressed sensing. Beal, M. J. (2003). Variational algorithms for approximate bayesian inference. Doctoral dissertation, Gatsby Computational Neuroscience Unit, Univ. College London. Blei, D., & Jordan, M. (2004). Variational methods for the dirichlet process. ICML. Candes, E. (2006). Compressive sensing. Proc. of ICM, 3, 1433­1452. Candes, E., Romberg, J., & Tao, T. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. on Inform. Theory, 52, 489­502. Charilaos, C. (1999). Jpeg2000 tutorial. ICIP. Donoho, D. L. (2006). Compressed sensing. IEEE Trans. Information Theory, 52, 1289­1306.