Sparse Bayesian Nonparametric Regression

Fran¸ois Caron c caronfr@cs.ubc.ca Arnaud Doucet arnaud@cs.ubc.ca Departments of Computer Science and Statistics, University of British Columbia, Vancouver, Canada

Abstract
One of the most common problems in machine learning and statistics consists of estimating the mean response X  from a vector of observations y assuming y = X  +  where X is known,  is a vector of parameters of interest and  a vector of stochastic errors. We are particularly interested here in the case where the dimension K of  is much higher than the dimension of y . We propose some flexible Bayesian models which can yield sparse estimates of  . We show that as K   these models are closely related to a class of L´vy processes. Simulae tions demonstrate that our models outperform significantly a range of popular alternatives.

an overcomplete basis (Lewicki & Sejnowski, 2000; Chen et al., 2001). Numerous models and algorithms have been proposed in the machine learning and statistics literature to address this problem including Bayesian stochastic search methods based on the `spike and slab' prior (West, 2003), Lasso (Tibshirani, 1996), pro jection pursuit or the Relevance Vector Machine (RVM) (Tipping, 2001). We follow here a Bayesian approach where we set a prior distribution on  and we will primarily focus on the case where  is the resulting Maximum a Posteriori (MAP) estimate or equivalently the Penalized Maximum Likelihood (PML) estimate. Such MAP/PML approaches have been discussed many times in the literature and include the Lasso (the corresponding prior being the Laplace distribution) (Tibshirani, 1996; Lewicki & Sejnowski, 2000; Girolami, 2001), the normal-Jeffreys (NJ) prior (Figueiredo, 2003) or the normal-exponential gamma prior (Griffin & Brown, 2007). Asymptotic theoretical properties of such PML estimates are discussed in (Fan & Li, 2001). (1) We propose here a class of prior distributions based on scale mixture of Gaussians for  . For a finite K , our prior models correspond to normal-gamma (NG) and normal-inverse Gaussian (NIG) models. This class of models includes as limiting cases both the popular Laplace and normal-Jeffreys priors but is more flexible. As K  , we show that the proposed priors are closely related to the variance gamma and normal-inverse Gaussian processes which are L´vy proe cesses (Applebaum, 2004). In this respect, our models are somehow complementary to two recently proposed Bayesian nonparametric models: the Indian buffet process (Ghahramani et al., 2006) and the infinite gamma-Poisson process (Titsias, 2007). Under given conditions, the normal-gamma prior yields sparse MAP estimates  . The log-posterior distributions associated to these prior distributions are not convex but we propose an Expectation-Maximization (EM) algorithm to find modes of the posteriors and

1. Introduction
Consider the following linear regression model y = X + 
L

where y  R is the observation,  = (1 , . . . , K )  RK is the vector of unknown parameters, X is an known L × K matrix. We will assume t0 at  follwws a h o zero-mean normal distribution   N ,  2 IL here IL is the identity matrix of dimension L. We do not impose here any restriction on L and K but we are particularly interested in the case where K >> L. This scenario is very common in many application domains. In such cases, we are interested in obtaining a sparse estimate of  ; that is an estimate  = (1 , . . . , K ) such that only a subset of the components k differ from zero. This might be for sake of variable selection (Tibshirani, 1996; Figueiredo, 2003; Griffin & Brown, 2007) or to decompose a signal over
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Sparse Bayesian Nonparametric Regression

a Markov Chain Monte Carlo (MCMC) algorithm to sample from them. We demonstrate through simulations that these Bayesian models outperform significantly a range of established procedures on a variety of applications. The rest of the paper is organized as follows. In Section 2, we propose the NG and NIG models for  . We establish some properties of these models for K finite and in the asymptotic case where K  . We also relate our model to the Indian buffet process (Ghahramani et al., 2006) and the infinite gamma-Poisson process (Titsias, 2007). In Section 3, we establish conditions under which the MAP/PML estimate  can enjoy sparsity properties. Section 4 presents an EM algorithm to find modes of the posterior distributions and a Gibbs sampling algorithm to sample from them. We demonstrate the performance of our models and algorithms in Section 5. Finally we discuss some extensions in Section 6.

( 2 ) K 2  -1 2 2 K exp(- k ).  (k ) ( K ) 2 Following Eq. (2), the marginal pdf of k is given for k = 0 by
 1  /K +1/2  | | K - 2 K K - 1 ( |k |) /K -1/2 (  ) k 2 2 K (3) where K (·) is the modified Bessel function of the second kind. We have   1 ( K - 2 )   if K > 1  2 2  ( K ) lim p(k ) = k  0  otherwise

2


p(k ) = 

2. Sparse Bayesian Nonparametric Mo dels
We will consider models where the components  are independent and identically distributed kK p( ) = p(k )
=1

and  the tails of this distribution decrease in -1 |k | K exp(- |k |), see Figure 1(a). The parameters  and  resp. control the shape and scale of the distribution. When   0, there is a high discrepancy 2 between the values of k , while when   , most of the values are equal.
2
 K  K  K

= 0.1, c = 1 = 0.1, c = 10 = 0.75, c = 10

6 5 4

 K  K  K

= 0.1, c = 1 = 0.1, c = 10 = 0.75, c = 10

1.5

p(k)

1

p(k) -0.5 0 k 0.5 1

3 2

0.5 1 0 -1 0 -1

and p (k ) is a scale mixture of Gaussians; that is N 2 d 2 2 p (k ) = (k ; 0, k )p k k (2) where N (x; µ,  2 ) denotes the Gaussian distribution of argument x, mean µ and variance  2 . We propose 2 two conjugate distributions for k ; namely the gamma and the inverse Gaussian distributions. The resulting marginal distribution for k belongs in both cases to the class of generalized hyperbolic distributions. In the models presented here, the unknown scale parameters are random and integrated out so that the marginal priors on the regression coefficients are not Gaussian. This differs from the RVM (Tipping, 2001) where these parameters are unknown and estimated through maximum likelihood. 2.1. Normal-Gamma Mo del 2.1.1. Definition Consider the following gamma prior distribution  2 2 k  G ( , ) K2
2 whose probability density function (pdf ) G (k ; K , 2 ) is given by
2

-0.5

0 k

0.5

1

(a) Normal-gamma

(b) Normal-inverse Gaussian

Figure 1. Probability density functions of the NG and NIG for different values of the parameters.

This class of priors includes many standard priors. In deed, Eq. (3) reduces to the Laplace prior when K = 1  and we obtain the NJ prior when K  0 and   0. In Figure 2 some realizations of the process are given for different values  = 1, 5, 100 and  2 /2 = . 2.1.2. Properties It follows from Eq. (3) that 4  ( K + 1 ) 2 2 2 , E[k ] = 2 E[|k |] =    2 ( K ) K and we obtain
K 

lim E[

kK
=1

k 2 |k |] = , E[ 

K 2 k ] = =1

2 . 2

Hence the sum of the terms remains bounded whatever being K .


Sparse Bayesian Nonparametric Regression
0.4 0.4

0.2

0.2

0 0 1

20

40

60

80

100

0 0 1

20

40

60

80

100

k

0

k

0

according to P D() and G (,  2 /2) where P D() is the Poisson-Dirichlet distribution of scale parameter . It is well-known that this distribution can be recovered by the following (infinite) stick-breaking construction (Tsilevich et al., 2000) as if we set k = k
k -1 j =1

2 k

-1 0

2 k

20

40 60 Feature k

80

100

-1 0

20

40 60 Feature k

80

100

(1 - j ) with j  B (1, )

(5)

(a)  = 1
0.06 0.04 2 k 0.02 0 0 1 20 40 60 80

(b)  = 5

 a for any k then the order statistics re dis(k) tributed from the Poisson-Dirichlet distribution. The coefficients (k ) are thus nothing but the weights (jumps) of the so-called variance gamma process which is a Brownian motion evaluated at times given by a gamma process (Applebaum, 2004; Madan & Seneta, 1990). 2.2. Normal-Inverse Gaussian Mo del 2.2.1. Definition Consider the following inverse Gaussian prior distribution  2 k  I G ( ,  ) (6) K 2 whose pdf I G (k ; K ,  ) is given by (Barndorff-Nielsen, 1997) 1 2 1  2 2  exp( )(k )-3/2 exp(- ( 2 2 +  2 k )) K 2 K k 2 K (7) Following Eq. (2), the marginal pdf of k is given  ( 2 -1 2 2   2 2 K1 exp( ) + k + k 8) K K K2 K2 and the tails of this distribution decrease in - 3 /2 |k | exp(- |k |). It is displayed in Figure 1(b). The parameters  and  resp. control the shape and scale of the distribution. When   0, there is a 2 high discrepancy between the values of k , while when   , most of the values are equal. Some realizations of the model, for different values of  are represented in Figure 3. 2.2.2. Properties The moments are given E[|k |] =    2 2 exp( )K0 ( ), E[k ] = K K K K

100

k

0

-1 0

20

40 60 Feature k

80

100

(c)  = 100 2 k Figure 2. Realizations (top) and (bottom) k =1,...,K {k }k=1,...,K from the NG model for  = 1, 5, 100.

Using properties of the gamma distribution, it is possible to relate  to a L´vy process known as the variance e gamma process as K  . First consider a finite K. 2 2 2 Let (1)  (2)  . . .  (K ) be the order statistics 2 2 2 of the sequence 1 , 2 , . . . , K and let 1 , . . . , K be random variables verifying the following (finite) stickbreaking construction k = k
k -1 j =1

(1 - j ) with j  B (1 +

 k , - ) (4) K K

where B is the Beta distribution. Finally if g  2 G, 2 f then we can check that the order statistics ( ) ollow the same distribution as the order statistics of (g k ). The characteristic function of k is given by 1 1 k (u) = 1 1  K u uK - i + i and therefore k = w1 - w2 where w1  G (
d 2 (k )

  , ) and w2  G ( , ) K K

It follows that k can be written as the difference of two variables following a gamma distribution.  a 2 As K  , the order statistics re the conic (k) part of a gamma process with shape parameter  and scale parameter  2 /2; see (Tsievich et al., 2000)afor l
k(1) , k(2) , . . . details. In particular  2 = nd 2 2  (k ) (k) k2 (k) are independent and respectively distributed
2

Therefore, as K  , the mean of sum of the absolute values is infinite while the sum of the square is  .  We can also establish in this case that the coefficients (k ) tend to weights (jumps) of a normal-inverse Gaussian process (Barndorff-Nielsen, 1997).

2


Sparse Bayesian Nonparametric Regression
0.4 0.1 1 0.2

2 k

2 k

2 k

0.2

0.05

0.5

2 k 20 40 60 80 100

0.1

0 0 1

20

40

60

80

100

0 0 1

20

40

60

80

100

0 0

0 0

20

40

60

80

100

Object n

k

0

k

0

10 15 20 20 40 60 Feature k 80 100

Object n

5

5 10 15 20 20 40 60 Feature k 80 100

-1 0

20

40 60 Feature k

80

100

-1 0

20

40 60 Feature k

80

100

(a)  = 1
0.015 0.01 2 k

(b)  = 5

(a)  = 1
0.1

(b)  = 5

2 k 0.005 0 0 1 20 40 60 80 100

0.05

0 0

20

40

60

80

100

Object n 20 40 60 Feature k 80 100

5 10 15 20 20 40 60 Feature k 80 100

k

0

-1 0

(c)  = 100
2 Figure 3. Realizations (top) (k )k=1,...,K and (bottom) (k )k=1,...,K from the NIG model for K = 100, N = 20,  = 1, 10, 100 and  = .

(c)  = 100
2 Figure 4. Realizations (top) (k )k=1,...,K and (bottom) n (k )n=1,...,N,k=1,...,K from the normal-gamma model for K = 100, N = 20,  = 1, 10, 100 and  2 /2 = . The n lighter the colour, the larger |k | .

2.3. Extension Consider now the case where we have N vectors of N observations {yn }n=1 where yn  RL . We would like to model the fact that for a given k the random variables nN {k }n=1 are statistically dependent and exchangeable. We consider the following hierarchical model
2 k

els could be interpreted as prior distributions over infinite matrices with real-valued entries. In our case, the number of non-zero entries of an (infinite) row is always infinite but we can have < K k  n lim E |k |  (9)
K  =1  K

 2  2  G ( , ) or k  I G ( ,  ) K2 K

for  = 1 or  = 2. Morever for some values of  we can also ensure that for any x > 0
K 

and (10)

for k = 1, . . . , K and
n 2 k  N (0, k )

lim

n Pr (k : |k | > x) > 0;

for n = 1, . . . , N . Some realizations of the process for different values  = 1, 5, 100 are represented in Figure 4. In this respect, this work is complementary to two recently proposed Bayesian nonparametric models: the Indian buffet process (Ghahramani et al., 2006) and the infinite gamma-Poisson process (Titsias, 2007). In these two contributions, prior distributions over infinite matrices with integer-valued entries are defined. These models are constructed as the limits of finitedimensional models based respectively on the betabinomial and gamma-Poisson models. They enjoy the following property: while the number of non-zero entries of an (infinite) row is potentially infinite, the expected number of these entries is finite. These models are also closely related to the beta and gamma processes which are L´vy processes (Applebaum, 2004; e Teh et al., 2007; Thibaux & Jordan, 2007). Our mod-

that is there is still a non-vanishing probability of having coefficients with large values as K   despite Eq. (9).
1N The joint distribution is given by p(1::K ) K 1:N k=1 p(k ) where for the NG mo del 1 K p(k:N )  uk


=

-N 2

K  - N ( uk )
K 2

and for the NIG model
1 p(k:N )  (qk ) -(N +1)/2

K N +1 ( qk )
2

where uk =

N
n2 (k ) , n=1

 qk =

2

K2

+ u2 k

(11)

3. Sparsity Prop erties
Further on we will also use the following notation for any random variable u pen(u)  log(p(u))


Sparse Bayesian Nonparametric Regression Table 1. Penalizations and their derivatives for different prior distributions
1 pen(k:N )  |k | 1 pen (k:N ) 

Lasso (N = 1) NJ NG NIG

N log(uk )  ( N - K ) log uk 2 - log K  - N ( uk )
K 2

N/uk
 K  - N -1 ( uk ) K  - N ( uk )
K 2 K 2

(a) Laplace
 K  K  K

(b) Normal-Jeffreys
= 0.01 = 0.75 =2
 K  K  K

log (qk ) - log K N +1 ( qk )
2

N +1 2

(N +1)uk 2 qk K N -1 ( qk ) + quk K N2 1 ( qk ) + k
2

= 0.01 = 0.25 =2

where `' denotes equal up to an additive constant independent of u. When computing the MAP/PML estimate for N data, we select  1:N = arg min
 1:N

(c) Normal-gamma

(d) Normal-inverse Gaussian

nN
=1

yn - X  n 2 k 2 - 2 2

K 1 pen(k:N ). =1

Figure 5. Contour of constant value of pen(1 ) + pen(2 ) for different prior distributions.

(12) 1 We give in Table 1 the penalizations pen(k:N ) and their derivatives for different prior distributions as a function of uk and qk defined in Eq. (11). When /K = 1, the NG prior is equal to the Laplace prior so its penalization reduces to the 1 penalization used in Lasso and basis pursuit (Tibshirani, 1996; Chen et al., 2001). When /K  0 and c  0 the prior is the NJ prior and the penalization reduces to log(|k |) which has been used in (Figueiredo, 2003). We display in Figure 5 the contours of constant value for various prior distributions when N = 1 and K = 2. For /K < 1/2, the MAP estimate (12) does not exist as the pdf (3) is unbounded. For other values of the parameters, a mode can dominate at zero whereas we are interested in the data driven turning point/local minimum (Griffin & Brown, 2007). Consider now the case where the matrix X is orthogonal,  = 1 and N = 1. The turning point and/or MAP/PML estimate is obtained by minimizing Eq. (12) which is equivalent to minimize componentwise 1 (zk - k )2 + pen(k ) (13) 2 where z = X T y . The first derivative of (13) is sig n(k ) (|k | + pen (|k |)) - zk . As stated in (Fan & Li, 2001, p. 1350), a sufficient condition for the estimate to be a thresholding rule is that the minimum of the function |k | + pen (|k |) is strictly positive. Plots of the function |k | + pen (|k |) are given in Figure 6 and the resulting thresholds corresponding to the argument minimizing (13) are presented in Figure 7. It follows that the normal-gamma prior is a thresholding

rule for /K  1 and yields sparse estimates. The normal-inverse Gaussian is not a thresholding rule as the derivative of the penalization is 0 when k = 0 whatever being the values of the parameters. However, from Figure 7(d), it is clear that it can yield "almost sparse estimates; that is most components are such " 0 that k .
20
 K  K  K  K

15

= 0.1 = 0.75 =1 =2

12 10 8

 K  K  K  K

= 0.1 = 0.75 =1 =2

10

6 4

5 2 0 0 0 0

0.5

1

1.5 

2

2.5

3

0.5

1

1.5 

2

2.5

3

(a) Normal-gamma

(b) Normal-inverse Gaussian

Figure 6. Plots of |k | + pen (|k |).

4. Algorithms
4.1. EM The log-posterior in Eq. (12) is not concave but we can use the EM algorithm to find modes of it. The EM algorithm relies on the introduction of the missing data 1:K = (1 , ..., K ). Conditional upon these missing data, the regression model is linear Gaussian and all the EM quantities can be easily computed in closed form; see for example (Figueiredo, 2003; Griffin & Brown, 2007). We have at iteration i + 1 of the EM


Sparse Bayesian Nonparametric Regression

5. Applications
5.1. Simulated Data In the following, we provide numerical comparisons between the Laplace (that is Lasso), the RVM, NJ, NG and NIG models. We simulate 100 datasets from (1) with L = 50 and  = 1. The correlation between Xk,i and Xk,j is |i-j | with  = 0.5. We set  = (3 1.5 0 0 2 0 0 . . .)T  RK where the remaining components of the vector are set to zero. We consider the cases where K = 20, 60, 100, 200. Parameters of the Lasso, NG and NIG are estimated by 5-fold crossvalidation, as described in (Tibshirani, 1996). The Lasso estimate is obtained with the Matlab implementation of the interior point method downloadable at http://www.stanford.edu/~boyd/l1 ls/. For the other priors, the estimate is obtained via 100 iterations of the EM algorithm. Box plots of the mean square error (MSE) are reported in Figure 8. These plots show that the performance of the estimators based on the NG and NIG priors outperform those of classical models in that case. In Figure 9 are represented the box plots of the number of estimated coefficients whose absolute value is below T , T = 10-10 (the precision tuned for the Lasso estimate) and T = 10-3 , for K = 200. The true number of zeros in that case is 197. The NG outperforms the other models in identifying the zeros of the model. On the contrary, as the NIG estimate is not a thresholding rule, the median number of coefficients whose absolute value is below 10-10 for this model is zero. However, most of the coefficients have a very low absolute value, as the median of the coefficients with absolute value below 10-3 is equal to the true value 197 (see Figure 9(b)). Moreover, the estimator obtained by thresholding the coefficients whose absolute value is below 10-3 to zero yields very minor differences in terms of MSE. 5.2. Biscuit NIR Dataset We consider the biscuits data which have been studied in (Griffin & Brown, 2007; West, 2003). The matrix X is composed of 300 (centered) NIR reflectance measurements from 70 biscuit dough pieces. The observations y are the percentage of fat, sucrose, flour and water associated to each piece. The ob jective here is to predict the level of each of the ingredients from the NIR reflectance measurements. The data are divided into a training dataset (39 measurements) and a test dataset (31 measurements). The fitted coefficients of fat and flour, using 5-fold cross-validation, are represented in Figure 10. The estimated spikes are consistent with the results obtained in (West, 2003; Griffin & Brown, 2007). In particular, both models detect

(a) Laplace
 K  K  K

(b) Normal-Jeffreys
 K  K  K

= 0.01 = 0.75 =2

= 0.01 = 0.25 =2

(d) Normal-inverse Gaussian Figure 7. Thresholds for the different prior distributions.

(c) Normal-gamma

 1:N = arg max Q( 1:N ;  1:N ) (i+1) (i)
 1: N 1N where Q( 1:N ; (i:) ) is given by l 1N og(p( 1:N |y1:N , 1:K )).p(1:K |(i:) , y1:N )d1:K .

After a few calculations, we obtain 2 -1 T n V(i) + X T X X yn (i+1) = with mk,(i) N
n=1

V(i) = 

u

= 2

k,(i) n k,(i)

pen uk,(i) ) where uk,(i) = u u , pen (uk,(i) ) =  pen(k k ) (see u
k,(i)

-1

diag (m1,(i) , . . . , mK,(i) )
(

and

Table 1). 4.2. MCMC We can also easily sample from the posterior distribu2 tion p( 1:N |y1:N ) by sampling from p( 1:N , 1:K |y1:N ) using the Gibbs sampler. Indeed the full conditional 2 distributions p( 1:N |1:K , y1:N ) and p(1:K | 1:N , y1:N ) are available in closed-form. The distribution p( 1:N |1:K , y1:N ) is a multivariate normal whereas we K 2 21 have p(1:K | 1:N , y1:N ) = k=1 p(k |k:N ). For the NG prior, we obtain 2 -2 1 uk 2 N 2 (k ) K - 2 -1 exp 2 2 k -  k 2 1:N p(k |k ) = u K-N 2 k K  - N (  uk ) 
K 2

which is a generalized inverse Gaussian distribution from which we can sample exactly. For the NIG distribution, we also obtain a generalized inverse Gaussian distribution.


Sparse Bayesian Nonparametric Regression
100 80 60 50 40 20 k 0 k 0 -50 -20 -40 -100 1200 1400 1600 1800 2000 Wavelength (nm) 2200 2400 -60 1200 1400 1600 1800 2000 Wavelength (nm) 2200 2400

0.8

1.2 1

0.6 MSE MSE Lasso RVM NJ NG NIG

0.8 0.6 0.4

0.4

0.2 0.2 0 0 Lasso RVM NJ NG NIG

(a) K = 20
1.6 1.4 1.2 1 MSE

(b) K = 60
2
80 60

(a) Fat (NG)
50

(b) Fat (NIG)

1.5 MSE

40 20 k 0 -20 k 1400 1600 1800 2000 Wavelength (nm) 2200 2400

0.8 0.6 0.4 0.2 0 Lasso RVM NJ NG NIG

1

0

0.5

-40 -60

0 Lasso RVM NJ NG NIG

-80 1200

-50 1200

1400

1600 1800 2000 Wavelength (nm)

2200

2400

(c) K = 100

(d) K = 200

(c) Flour (NG)

(d) Flour (NIG)

Figure 8. Box plots of the MSE associated to the simulated data.
200
200

Figure 10. Coefficients estimated with a normal-gamma (left) and normal-inverse Gaussian (right) prior for fat (top) and flour (bottom) ingredients.

-10

Values below 10-3

150

150

Values below 10

Table 2. MSE for biscuits NIR data

100

100

50

50

0 Lasso RVM NJ NG NIG
Lasso RVM NJ NG NIG

(a) T = 10-10

(b) T = 10-3

NJ RVM NG NIG

Flour 9.93 6.48 3.44 1.94

Fat 0.56 0.56 0.55 0.49

Figure 9. Box plots of the number of estimated coefficients whose absolute value is below a threshold T . Dash line represents the true value of zero coefficients (197).

a spike at 1726nm, which lies in a region known for fat absorbance. The predicted observations versus the true observations are given in Figure 11 for the training and test datasets. The test data are well fitted by the estimated coefficients. MSE errors for the test dataset are reported in Table 2. The proposed models show better performances for flour and similar performances for fat.

6. Discussion
We have presented some flexible priors for linear regression based on the NG and NIG models. The NG prior yields sparse local maxima of the posterior distribution whereas the NIG prior yields "almost sparse" estimates; that is most of the coefficients are extremely close to zero. We have shown that asymptotically these models are closely related to the variance gamma process and the normal-inverse Gaussian process. Contrary to the NJ model or the RVM,

these models require specifying two hyperparameters. However, using a simple cross-validation procedure we have demonstrated that these models can perform significantly better that well-established procedures. In particular, the experimental performance of the NIG model are surprisingly good and deserve being further studied. The NG prior has been discussed in (Griffin & Brown, 2007). It was discarded because of its spike at zero and the flatness of the penalty for large values but no simulations were provided. They favour another model which relies on a cylinder parabolic function1 . The NG prior has nonetheless interesting asymptotic properties in terms of L´vy processes and we have e demonstrated its empirical performances. The NG, NIG and Laplace priors can also be considered as particular cases of generalized hyperbolic distributions. This class of distributions has been used in (Snoussi & Idier, 2006) for blind source separation. The extension to (probit) classification is straightfor1 The authors provide a link to a program to compute this function. Unfortunately, it is extremely slow. The resulting algorithm is at least one order of magnitude slower than our algorithms which rely on Bessel functions.


Sparse Bayesian Nonparametric Regression
3 2 Observations 1 0 -1 -2 -3 -3 Training data Test data 3 2 Observations 1 0 -1 -2 -3 -3 Training data Test data

Girolami, M. (2001). A variational method for learning sparse and overcomplete representations. Neural computation, 13, 2517­2532. Griffin, J., & Brown, P. (2007). Bayesian adaptive lasso with non-convex penalization (Technical Report). Dept of Statistics, University of Warwick. Lewicki, M. S., & Sejnowski, T. (2000). Learning overcomplete representations. Neural computation, 12, 337­365. Madan, D., & Seneta, E. (1990). The variance-gamma model for share market returns. Journal of Business, 63, 511­524.

-2

-1 0 1 Predicted observations

2

3

-2

-1 0 1 Predicted observations

2

3

(a) Fat (NG)
3 2 Observations 1 0 -1 -2 -3 -3 Training data Test data 3 2 Observations 1 0 -1 -2 -3 -3

(b) Fat (NIG)
Training data Test data

-2

-1 0 1 Predicted observations

2

3

-2

-1 0 1 Predicted observations

2

3

(c) Flour (NG)

(d) Flour (NIG)

Figure 11. Observations versus predicted observations estimated with a normal-gamma (left) and normal-inverse Gaussian (right) prior for fat (top) and flour (bottom) ingredients.

Snoussi, H., & Idier, J. (2006). Bayesian blind separation of generalized hyperbolic processes in noisy and underdeterminate mixtures. IEEE Transactions on Signal Processing, 54, 3257­3269. Teh, Y., Gorur, D., & Ghahramani, Z. (2007). Stickbreaking construction for the Indian buffet process. International Conference on Artificial Intel ligence and Statistics. Thibaux, R., & Jordan, M. (2007). Hierarchical beta processes and the Indian buffet process. International Conference on Artificial Intel ligence and Statistics. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society B, 58, 267­288. Tipping, M. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 211-244, 211­244. Titsias, M. (2007). The infinite gamma-Poisson feature model. International Conference on Neural Information Processing Systems. Tsilevich, N., Vershik, A., & Yor, M. (2000). Distinguished properties of the gamma process, and related topics (Technical Report). Laboratoire de Probabilit´s et Mod`les al´atoires, Paris. e e e West, M. (2003). Bayesian factor regression models in the "Large p, Small n" paradigm. In J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith and M. West (Eds.), Bayesian statistics 7, 723­732. Oxford University Press.

ward by adding latent variables corresponding to the regression function plus some normal noise. Computationally it only requires adding one line in the EM algorithm and one simulation step in the Gibbs sampler.

References
Applebaum, D. (2004). L´vy processes and stochastic e calculus. Cambridge University Press. Barndorff-Nielsen, O. (1997). Normal inverse Gaussian distributions and stochastic volatility modelling. Scandinavian Journal of Statistics, 24, 1­13. Chen, S., Donoho, D., & Saunders, M. (2001). Atomic decomposition by basis pursuit. SIAM review, 43, 129­159. Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348­1360. Figueiredo, M. (2003). Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intel ligence, 25, 1150­1159. Ghahramani, Z., Griffiths, T., & Sollich, P. (2006). Bayesian nonparametric latent feature models. Proceedings Valencia/ISBA World meeting on Bayesian statistics.