A Stick-Breaking Construction of the Beta Process

John Paisley1 jwp4@ee.duke.edu Aimee Zaas2 aimee.zaas@duke.edu Christopher W. Woods2 woods004@mc.duke.edu Geoffrey S. Ginsburg2 ginsb005@duke.edu Lawrence Carin1 lcarin@ee.duke.edu 1 Department of ECE, 2 Duke University Medical Center, Duke University, Durham, NC

Abstract
We present and derive a new stick-breaking construction of the beta process. The construction is closely related to a special case of the stick-breaking construction of the Dirichlet process (Sethuraman, 1994) applied to the beta distribution. We derive an inference procedure that relies on Monte Carlo integration to reduce the number of parameters to be inferred, and present results on synthetic data, the MNIST handwritten digits data set and a time-evolving gene expression data set.

To review, a Dirichlet process, G, can be constructed according to the following stick-breaking process (Sethuraman, 1994; Ishwaran & James, 2001),
 i-1

G Vi i

=
i=1 iid

Vi

(1 - Vj )i
j=1


Beta(1, ) G0 (1)

iid


This stick-breaking process is so-called because proportions, Vi , are sequentially broken from the remaini-1 ing length, j=1 (1 - Vj ), of a unit-length stick. This produces a probability (or weight), Vi j=1 (1 - Vj ), that can be visually represented as one of an infinite number of contiguous sections cut out of a unit-length stick. As i increases, these weights stochastically decrease, since smaller and smaller fractions of the stick remain, and so only a small number of the infinite number of weights have appreciable value. By construction, these weights occur first, which allows for practical implementation of this prior. The contribution of this paper is the derivation of a stick-breaking construction of the beta process. We use a little-known property of the constructive definition in (Sethuraman, 1994), which is equally applicable to the beta distribution ­ a two-dimensional Dirichlet distribution. The construction presented here will be seen to result from an infinite collection of these stickbreaking constructions of the beta distribution. The paper is organized as follows. In Section 2, we review the beta process, the stick-breaking construction of the beta distribution, as well as related work in this area. In Section 3, we present the stick-breaking construction of the beta process and its derivation. We derive an inference procedure for the construction in Section 4 and present experimental results on synthetic, MNIST digits and gene expression data in Section 5.
i-1

1. Introduction
The Dirichlet process (Ferguson, 1973) is a powerful Bayesian nonparametric prior for mixture models. There are two principle methods for drawing from this infinite-dimensional prior: (i) the Chinese restaurant process (Blackwell & MacQueen, 1973), in which samples are drawn from a marginalized Dirichlet process and implicitly construct the prior; and (ii) the stickbreaking process (Sethuraman, 1994), which is a fully Bayesian construction of the Dirichlet process. Similarly, the beta process (Hjort, 1990) is receiving significant use recently as a nonparametric prior for latent factor models (Ghahramani et al., 2007; Thibaux & Jordan, 2007). This infinite-dimensional prior can be drawn via marginalization using the Indian buffet process (Griffiths & Ghahramani, 2005), where samples again construct the prior. However, unlike the Dirichlet process, the fully Bayesian stick-breaking construction of the beta process has yet to be derived (though related methods exist, reviewed in Section 2).
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

A Stick-Breaking Construction of the Beta Process

2. The Beta Process
Let H0 be a continuous measure on the space (, B) and let H0 () = . Also, let  be a positive scalar and define the process HK as follows,
K

HK k k

=
k=1 iid

k   k Beta 1 H0    , (1 - ) K K (2)

In this construction, weights are drawn according to the standard stick-breaking construction of the DP (Ishwaran & James, 2001), as well as their respective locations, which are independent of the weights and iid among themselves. The major difference is that the set of locations is finite, 0 or 1, which results in more than one term being active in the summation. Space restrictions prohibit an explicit proof of this construction here, but we note that Sethuraman implicitly proves this in the following way: Using notation from (Sethuraman, 1994), let the space, X = {0, 1}, and the prior measure, , be (1) = a, (0) = b, and therefore (X ) = a + b. Carrying out the proof in (Sethuraman, 1994) for this particular space and measure yields (4). We note that this  is different from that in (2). 2.2. Related Work There are three related constructions in the machine learning literature, each of which differs significantly from that presented here. The first construction, proposed by (Teh et al., 2007), is presented specifically for the Indian buffet process (IBP) prior. The generative process from which the IBP and this construction are derived replaces the beta distribution in (2)  with Beta( K , 1). This small change greatly facilitates  this construction, since the parameter 1 in Beta( K , 1) allows for a necessary simplification of the beta distribution. This construction does not extend to the two-parameter generalization of the IBP (Ghahramani et al., 2007), which is equivalent in the infinite limit to the marginalized representation in (2). A second method for drawing directly from the beta process prior has been presented in (Thibaux & Jordan, 2007), and more recently in (Teh & G¨r¨r, 2009) ou as a special case of a more general power-law representation of the IBP. In this representation, no stickbreaking takes place of the form in (1), but rather the weight for each location is simply beta-distributed, as opposed to the usual function of multiple betadistributed random variables. The derivation relies heavily upon connecting the marginalized process to the fully Bayesian representation, which does not factor into the similar derivation for the DP (Sethuraman, 1994). This of course does not detract from the result, which appears to have a simpler inference procedure than that presented here. A third representation presented in (Teh & G¨r¨r, ou 2009) based on the inverse L´vy method (Wolpert & e Ickstadt, 1998) exists in theory only and does not simplify to an analytic stick-breaking form. See (Damien et al., 1996; Lee & Kim, 2004) for two approximate methods for sampling from the beta process.

 
iid

then as K  , HK  H and H is a beta process, which we denote H  BP(H0 ). We avoid a complete measure-theoretic definition, since the stick-breaking construction to be presented is derived in reference to the limit of (2). That H is a beta process can be shown in the following way: Integrating out  (K) = (1 , . . . , K )T  (0, 1)K , letting K   and sampling from this marginal distribution produces the two-parameter extension of the Indian buffet process discussed in (Thibaux & Jordan, 2007), which is shown to have the beta process as its underlying de Finetti mixing distribution. Before deriving the stick-breaking construction of the beta process, we review a property of the beta distribution that will be central to the construction. We also review related work to distinguish the presented construction from other constructions in the literature. 2.1. A Construction of the Beta Distribution The constructive definition of a Dirichlet prior derived in (Sethuraman, 1994) applies to more than the infinite-dimensional Dirichlet process. In fact, it is applicable to Dirichlet priors of any dimension, of which the beta distribution can be viewed as a special, two-dimensional case.1 Focusing on this special case, Sethuraman showed that one can sample   Beta(a, b) (3)

according to the following stick-breaking construction,
 i-1

 Vi Yi

=
i=1 iid

Vi
j=1

(1 - Vj )I(Yi = 1)


iid


Beta(1, a + b) a Bernoulli a+b

(4)

where I(·) denotes the indicator function.
1 We thank Jayaram Sethuraman for his valuable correspondence regarding his constructive definition.

A Stick-Breaking Construction of the Beta Process

3. A Stick-Breaking Construction of the Beta Process
We now define and briefly discuss the stick-breaking construction of the beta process, followed by its derivation. Let  and H0 be defined as in (2). If H is constructed according to the following process,
 Ci i-1

3.1. Derivation of the Construction Starting with (2), we now show how Sethuraman's constructive definition of the beta distribution can be used to derive that the infinite limit of (2) has (5) as an alternate representation that is equal in distribution. We begin by observing that, according to (4), each k value can be drawn as follows,
 l-1

H Ci Vij
( )

=
i=1 j=1 iid

Vij

(i)

(1 - Vij )ij
=1

( )

k ^ Vkl (5) ^ Ykl

=
l=1 iid

^ Vkl Beta(1, ) Bernoulli

^ ^ (1 - Vkm )I(Ykl = 1)
m=1

 
Poisson() Beta(1, ) 1 H0 

iid iid

 
iid

ij


 K

(7)

then H  BP(H0 ). Since the first row of (5) may be unclear at first sight, we expand it for the first few values of i below,
C1

where the marker ^ is introduced because V will later ^ be re-indexed values of V . We also make the observation that, if the sum is instead taken to K , and we then let K  , then this truncated representation converges to (7). This suggests the following procedure for constructing the limit of the vector  (K) in (2). We define the ^ ^ matrices V  (0, 1)K×K and Y  {0, 1}K×K , where ^ Vkl
iid

H

=
j=1 C2

V1,j 1,j + V2,j (1 - V2,j )2,j +
j=1 C3 (2) (1)

(1)


Beta(1, ) Bernoulli  K (8)

V3,j (1 - V3,j )(1 - V3,j )3,j + · · · (6)
j=1

(3)

(2)

(1)

^ Ykl

iid


For each value of i, which we refer to as a "round," there are Ci atoms, where Ci is itself random and drawn from Poisson(). Therefore, every atom is defined by two subscripts, (i, j). The mass associated with each atom in round i is equal to the ith break from an atom-specific stick, where the stick-breaking weights follow a Beta(1, ) stick-breaking process (as in (1)). Superscripts are used to index the i random variables that construct the weight on atom ij . Since the number of breaks from the unit-length stick prior to obtaining a weight increases with each level in (6), the weights stochastically decrease as i increases, in a similar manner as in the stick-breaking construction of the Dirichlet process (1). Since the expectation of the mass on the k th atom drawn overall does not simplify to a compact and transparent form, we omit its presentation here. However, we note the following relationship between  and  in the construction. As  decreases, weights decay more rapidly as i increases, since smaller fractions of each unit-length stick remains prior to obtaining a weight. As  increases, the weights decay more gradually over several rounds. The expected weight on an atom in round i is equal to (i-1) /(1 + )i . The number of atoms in each round is controlled by .

for k = 1, . . . , K and l = 1, . . . , K. The K-truncated weight, k , is then constructed "horizontally" by look^ ^ ing at the k th row of V and Y , and where we define that the error of the truncation is assigned to 1 - k (i.e., Yk,l := 0 for the extension l > K.) It can be seen from the matrix definitions in (8) and the underlying function of these two matrices, defined for each row as a K-truncated version of (7), that in the limit as K  , this representation converges to the infinite beta process when viewed vertically, and to a construction of the individual beta-distributed random variables when viewed horizontally, each of which occur simultaneously. Before using these two matrices to derive (5), we derive a probability that will be used in the infinite limit. For a given column, i, of (8), we calculate the probability that, for a particular row, k, there is at least one ^ ^ ^ Y = 1 in the set {Yk,1 , . . . , Yk,i-1 }, in other words, the i-1 ^ probability that i =1 Yki > 0. This value is
i-1

P
i =1

^ Yki > 0|, K

= 1 - (1 -

 i-1 ) K

(9)

In the limit as K  , this can be shown to converge to zero for all fixed values of i.

A Stick-Breaking Construction of the Beta Process

As with the Dirichlet process, the problem with drawing each k explicitly in the limit of (2) is that there are an infinite number of them, and any given k is equal to zero with probability one. With the representation in (8), this problem appears to have doubled, since there are now an infinite number of random variables to sample in two dimensions, rather than one. However, this is only true when viewed horizontally. When viewed vertically, drawing the values of interest becomes manageable. First, we observe that, in (8), we only care about the ^ set of indices {(k, l) : Ykl = 1}, since these are the locations which indicate that mass is to be added to their respective k values. Therefore, we seek to by^ pass the drawing of all indices for which Y = 0, and ^ = 1. directly draw those indices for which Y To do this, we use a property of the binomial distribu^ tion. For any column, i, of Y , the number of nonzero K ^ki , has the Binomial(K,  ) distribulocations, k=1 Y K tion. Also, it is well-known that Poisson() = lim Binomial K,
K

these numbers, Ci :=
iid

 k=1

^ Yki , as (11)

Ci  Poisson()

We next need to sample index values uniformly from the positive integers, N. However, we recall from (9) that for all fixed values of i, the probability that the drawn index will have previously seen a one is equal to zero. Therefore, using the conceptual process defined above, we can bypass sampling the index value and directly sample the atom which it indexes. Also, we note that the "without replacement" constraint no longer factors. The final step is simply a matter of re-indexing. Let the function i (j) map the input j  {1, . . . , Ci } to the index of the j th nonzero element drawn in column i, as discussed above. Then the re-indexed random vari(i) ( ) ^ ^ ables Vij := Vi (j),i and Vij := Vi (j), , where < i. We similarly re-index i (j) as ij := i (j) , letting the double and single subscripts remove ambiguity, and hence no ^ marker is used. The addition of a subscript/superscript in the two cases above arises from ordering the nonzero locations for each column of (8), i.e., the original index values for the selected rows of each column are being mapped to 1, 2, . . . separately for each column in a many-to-one manner. The result of this re-indexing is the process given in (5).

 K

(10)

Therefore, in the limit as K  , the sum of each ^ column (as well as row) of Y produces a random variable with a Poisson() distribution. This suggests the procedure of first drawing the number of nonzero locations for each column, followed by their corresponding indices. Returning to (8), given the number of nonzero locaK  ^ tions in column i, k=1 Yki  Binomial(K, K ), finding the indices of these locations then becomes a process of sampling uniformly from {1, . . . , K} without replacement. Moreover, since there is a one-to-one correspondence between these indices and the atoms, iid 1 1 , . . . , K   H0 , which they index, this is equivalent to selecting from the set of atoms, {1 , . . . , K }, uniformly without replacement. A third more conceptual process, which will aid the K ^ derivation, is as follows: Sample the k=1 Yki nonzero indices for column i one at a time. After an index, k , ^ ^ is obtained, check {Yk ,1 , . . . , Yk ,i-1 } to see whether this index has already been drawn. If it has, add the i-1 corresponding mass, Vk i l=1 (1 - Vk l ), to the tally 1 for k . If it has not, draw a new atom, k   H0 , and associate the mass with this atom. The derivation concludes by observing the behavior of this last process as K  . We first reiterate that, in the limit as K  , the count of nonzero locations for each column is independent and identically distributed as Poisson(). Therefore, for i = 1, 2, . . . , we can draw

4. Inference for the Stick-Breaking Construction
For inference, we integrate out all stick-breaking random variables, V , using Monte Carlo integration (Gamerman & Lopes, 2006), which significantly reduces the number of random variables to be learned. As a second aid for inference, we introduce the latent round-indicator variable,  
 i

dk := 1 +
i=1

I
j=1

Cj < k 

(12)

The equality dk = i indicates that the k th atom drawn overall occurred in round i. Note that, given {dk } , k=1 we can reconstruct {Ci } . Given these latent indii=1 cators, the generative process is rewritten as,
 dk -1

H | {dk } k=1 Vkj k

=
k=1 iid

Vk,dk
j=1

(1 - Vkj )k


iid


Beta(1, ) 1 H0 

(13)

where, for clarity in what follows, we've avoided intro~ ducing a third marker (e.g., V ) after this re-indexing.

A Stick-Breaking Construction of the Beta Process

Data is generated iid from H via a Bernoulli process and take the form of infinite-dimensional binary vectors, zn  {0, 1} , where znk  Bernoulli Vk,dk
dk -1 j=1

4.1.2. Prior Term The prior for the sequence of indicators d1 , d2 , . . . is the equivalent sequential process for sampling  C1 , C2 , . . . , where Ci = k=1 I(dk = i)  Poisson(). k-1 Let #dk-1 = j=1 I(dj = dk-1 ) and let P (·) denote the Poisson distribution with parameter . Then it can be shown that P (C > #dk-1 ) (19) p(dk = dk-1 |, #dk-1 ) = P (C  #dk-1 ) Also, for h = 1, 2, . . . , the probability p(dk = dk-1 + h|, #dk-1 ) = (20)

(1 - Vkj )

(14)

The sufficient statistics calculated from {zn }N are n=1 the counts along each dimension, k,
N N

m1k =
n=1

I(znk = 1), m0k =
n=1

I(znk = 0)

(15)

4.1. Inference for dk With each iteration, we sample the sequence without using future values from the previous iteration; the value of K is random and equals the number of nonzero m1k . The probability that the k th atom was observed in round i is proportional to p dk = i|{dl }k-1 , {znk }N , ,  n=1 l=1  i|{dl }k-1 , ) l=1 (16) {dk }K k=1

P (C > #dk-1 ) P (C > 0)P (C = 0)h-1 P (C  #dk-1 ) Since dk < dk-1 , these two terms complete the prior. 1- 4.1.3. Posterior of dk For the posterior, the normalizing constant requires integration over h = 0, 1, 2, . . . , which is not possible given the proposed sampling method. We therefore propose incrementing h until the resulting truncated probability of the largest value of h falls below a threshold (e.g., 10-6 ). We have found that the probabilities tend to decrease rapidly for h > 1. 4.2. Inference for  Given d1 , d2 , . . . , the values C1 , C2 , . . . can be reconstructed and a posterior for  can be obtained using a conjugate gamma prior. Since the value of dK may not be the last in the sequence composing CdK , this value can be "completed" by sampling from the prior, which can additionally serve as proposal factors. 4.3. Inference for  Using (18), we again integrate out all stick-breaking random variables to calculate the posterior of ,
K

p({znk }N |dk n=1

= i, )p(dk =

Below, we discuss the likelihood and prior terms, followed by an approximation to the posterior. 4.1.1. Likelihood Term The integral to be solved for integrating out the random variables {Vkj }i is j=1 p({znk }N |dk = i, ) = n=1
(0,1)i

(17)

f ({Vkj }i )m1k {1-f ({Vkj }i )}m0k p({Vkj }i |) dV 1 1 1

where f (·) is the stick-breaking function used in (14). Though this integral can be analytically solved for integer values of m0k via the binomial expansion, we have found that the resulting sum of terms leads to computational precision issues for even small sample sizes. Therefore, we use Monte Carlo methods to approximate this integral. For s = 1, . . . , S samples, {Vkj }i , drawn iid from j=1 Beta(1, ), we calculate p({znk }N |dk = i, )  n=1 1 S
S (s) (s) (s)

p(|{zn }N , {dk }K )  1 1
k=1

p({znk }N |, {dk }K )p() 1 1

(18)

f ({Vkj }i )m1k {1 - f ({Vkj }i )}m0k j=1 j=1
s=1

This approximation allows for the use of natural logarithms in calculating the posterior, which was not possible with the analytic solution. Also, to reduce computations, we note that at most two random variables need to be drawn to perform the above stick-breaking, one random variable for the proportion and one for the error; this is detailed in the appendix.

Since this is not possible for the positive, real-valued , we approximate this posterior by discretizing the space. Specifically, using the value of  from the previous iteration, prev , we perform Monte Carlo integration at the points {prev + t}T t=-T , ensuring that prev - T  > 0. We use an improper, uniform prior for , with the resulting probability therefore being the normalized likelihood over the discrete set of selected points. As with sampling dk , we again extend the limits beyond prev ± T , checking that the tails of the resulting probability fall below a threshold.

A Stick-Breaking Construction of the Beta Process

4.4. Inference for p(znk = 1|, dk , Zprev ) In latent factor models, (Griffiths & Ghahramani, 2005), the vectors {zn }N are to be learned with the n=1 rest of the model parameters. To calculate the posterior of a given binary indicator therefore requires a prior, which we calculate as follows p(znk = 1|, dk , Zprev ) =
(0,1)dk (0,1)dk

(21)

p(znk = 1|V )p(V |, dk , Zprev ) dV p(znk = 1|V )p(Zprev |V )p(V |, dk ) dV
(0,1)dk

=

p(Zprev |V )p(V |, dk ) dV

We again perform Monte Carlo integration (18), where the numerator increments the count m1k of the denominator by one. For computational speed, we treat the previous latent indicators, Zprev , as a block (Ishwaran & James, 2001), allowing this probability to remain fixed when sampling the new matrix, Z.

Figure 1. Synthetic results for learning  and . For each trial of 150 iterations, 10 samples were collected and averaged over the last 50 iterations. The step size  = 0.1. (a) Inferred  vs true  (b) Inferred  vs true  (c) A plane, shown as an image, fit using least squares that shows the 1 distance of the inferred (out , out ) to the true (true , true ).

5.2. MNIST Handwritten Digits We consider the digits 3, 5 and 8 using 1000 observations for each digit and projecting into 50 dimensions using PCA. We model the resulting digits matrix, X  R50×3000 , with a latent factor model (Griffiths & Ghahramani, 2005; Paisley & Carin, 2009), X = (W  Z) + E (22)

5. Experiments
We present experimental results on three data sets: (i) A synthetic data set; (ii) the MNIST handwritten digits data set (digits 3, 5 and 8); and (iii) a timeevolving gene expression data set. 5.1. Synthetic Data For the synthetic problem, we investigate the ability of the inference procedure in Section 4 to learn the underlying  and  used in generating H. We use the representation in (2) to generate  (K) for K = 100,000. This provides a sample of  (K) that approximates the infinite beta process well for smaller values of  and . We then sample {zn }1000 from a Bernoulli process and n=1 remove all dimensions, k, for which m1k = 0. Since the weights in (13) are stochastically decreasing as k increases, while the representation in (2) is exchangeable in k, we reorder the dimensions of {zn }1000 so that n=1 m1,1  m1,2  . . . . The binary vectors are treated as observed for this problem. We present results in Figure 1 for 5,500 trials, where true  Uniform(1, 10) and true  Uniform(1, 10). We see that the inferred out and out values center on the true true and true , but increase in variance as these values increase. We believe that this is due in part to the reordering of the dimensions, which are not strictly decreasing in (5), though some reordering is necessary because of the nature of the two priors. We choose to generate data from (2) rather than (5) because it provides some added empirical evidence as to the correctness of the stick-breaking construction.

where the columns of Z are samples from a Bernoulli process, and the elements of  and W are iid Gaussian. The symbol  indicates element-wise multiplication. We infer all variance parameters using inverse-gamma priors, and integrate out the weights, wn , when sampling zn . Gibbs sampling is performed for all parameters, except for the variance parameters, where we perform variational inference (Bishop, 2006). We have found that the "inflation" of the variance parameters that results from the variational expectation leads to faster mixing for the latent factor model. Figure 2 displays the inference results for an initialization of K = 200. The top-left figure shows the number of factors as a function of 10,000 Gibbs iterations, and the top-right figure shows the histogram of these values after 1000 burn-in iterations. For Monte Carlo integration, we use S = 100,000 samples from the stick-breaking prior for sampling dk and p(znk = 1|, dk , Zprev ), and S = 10,000 samples for sampling , since learning the parameter  requires significantly more overall samples. The average time per iteration was approximately 18 seconds, though this value increases when K increases and vice-versa. In the bottom two rows of Figure 2, we show four example factor loadings (columns of ), as well as the probability of its being used by a 3, 5 and 8.

A Stick-Breaking Construction of the Beta Process

Figure 2. Results for MNIST digits 3, 5 and 8. Top left: The number of factors as a function of iteration number. Top right: A histogram of the number of factors after 1000 burn-in iterations. Middle row: Several example learned factors. Bottom row: The probability of a digit possessing the factor directly above.

5.3. Time-Evolving Gene Expression Data We next apply the model discussed in Section 5.2 on data from a viral challenge study (Zaas et al., 2009). In this study, a cohort of 17 healthy volunteers were experimentally infected with the influenza A virus at varying dosages. Blood was taken at intervals between -4 and 120 hours from infection and gene expression values were extracted. Of the 17 patients, 9 ultimately became symptomatic (i.e., became ill), and the goal of the study was to detect this in the gene expression values prior to the initial showing of symptoms. There were a total of 16 time points and 267 gene expression extractions, each including expression values for 12,023 genes. Therefore, the data matrix X  R267×12023 . In Figure 3, we show results for 4000 iterations; each iteration took an average of 2.18 minutes. The top row shows the number of factors as a function of iteration, with 100 initial factors, and histograms of the overall number factors, and the number of factors per observation. In the remaining rows, we show four discriminative factor loading vectors, with the statistics from the 267 values displayed as a function of time. We note that the expression values begin to increase for the symptomatic patients prior to the onset of symptoms around the 45th hour. We list the top genes for each factor, as determined by the magnitude of values in W for that factor. In addition, the top three genes in terms of the magnitude of the four-dimensional vector comprising these factors are RSAD2, IFI27 and IFI44L; the genes listed here have a significant overlap with those in the literature (Zaas et al., 2009).

Figure 3. Results for time-evolving gene expression data. Top row: (left) Number of factors per iteration (middle) Histogram of the total number of factors after 1000 burn-in iterations (right) Histogram of the number of factors used per observation. Rows 2-5: Discriminative factors and the names of the most important genes associated with each factor (as determined by weight).

As motivated in (Griffiths & Ghahramani, 2005), the values in Z are an alternative to hard clustering, and in this case are useful for group selection. For example, sparse linear classifiers for the model y = X + , such as the RVM (Bishop, 2006), are prone to select single correlated genes from X for prediction, setting the others to zero. In (West, 2003), latent factor models were motivated as a dimensionality reduction step ^ prior to learning the classifier y =  + 2 , where the loading matrix replaces X and unlabeled data are inferred transductively. In this case, discriminative factors selected by the model represent groups of genes associated with that factor, as indicated by Z.

A Stick-Breaking Construction of the Beta Process

6. Conclusion
We have presented a new stick-breaking construction of the beta process. The derivation relies heavily upon the constructive definition of the beta distribution, a special case of (Sethuraman, 1994), which has been exclusively used in its infinite form in the machine learning community. We presented an inference algorithm that uses Monte Carlo integration to eliminate several random variables. Results were presented on synthetic data, the MNIST handwritten digits 3, 5 and 8, and time-evolving gene expression data. As a final comment, we note that the limit of the representation in (2) reduces to the original IBP when  = 1. Therefore, the stick-breaking process in (5) should be equal in distribution to the process in (Teh et al., 2007) for this parametrization. The proof of this equality is an interesting question for future work. Acknowledgements This research was supported by DARPA under the PHD Program, and by the Army Research Office under grant W911NF-08-1-0182.

Ishwaran, H. and James, L.F. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96:161­173, 2001. Lee, Jaeyong and Kim, Yongdai. A new algorithm to generate beta processes. Computational Statistics & Data Analysis, 47(3):441­453, 2004. Paisley, J. and Carin, L. Nonparametric factor analysis with beta process priors. In Proc. of the ICML, pp. 777­784, 2009. Sethuraman, J. A constructive definition of dirichlet priors. Statistica Sinica, pp. 4:639­650, 1994. Teh, Y.W. and G¨r¨r, D. Indian buffet processes with ou power-law behavior. In NIPS, 2009. Teh, Y.W., G¨r¨r, D., and Ghahramani, Z. Stickou breaking construction for the indian buffet process. In AISTATS, 2007. Thibaux, R. and Jordan, M.I. Hierarchical beta processes and the indian buffet process. In AISTATS, 2007. West, M. Bayesian factor regression models in the "large p, small n" paradigm. Bayesian Statistics, 2003. Wolpert, R.L. and Ickstadt, K. Simulations of l´vy random fields. Practical and Semiparametric e Bayesian Statistics, pp. 227­242, 1998. Zaas, A., Chen, M., Varkey, J., Veldman, T., Hero, A.O., Lucas, J., Huang, Y., Turner, R., Gilbert, A., Lambkin-Williams, R., Oien, N., Nicholson, B., Kingsmore, S., Carin, L., Woods, C., and Ginsburg, G.S. Gene expression signatures diagnose influenza and other symptomatic respiratory viral infections in humans. Cell Host & Microbe, 6:207­217, 2009.

References
Bishop, C.M. Pattern Recognition and Machine Learning. Springer, New York, 2006. Blackwell, D. and MacQueen, J.B. Ferguson distributions via p´lya urn schemes. The Annals of Statiso tics, 1:353­355, 1973. Damien, Paul, Laud, Purushottam W., and Smith, Adrian F. M. Implementation of bayesian nonparametric inference based on beta processes. Scandinavian Journal of Statistics, 23(1):27­36, 1996. Ferguson, T. A bayesian analysis of some nonparametric problems. The Annals of Statistics, pp. 1:209­ 230, 1973. Gamerman, D. and Lopes, H.F. Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Second Edition. Chapman & Hall, 2006. Ghahramani, Z., Griffiths, T.L., and Sollich, P. Bayesian nonparametric latent feature models. Bayesian Statistics, 2007. Griffiths, T.L. and Ghahramani, Z. Infinite latent feature models and the indian buffet process. In NIPS, pp. 475­482, 2005. Hjort, N.L. Nonparametric bayes estimators based on beta processes in models for life history data. Annals of Statistics, 18:3:1259­1294, 1990.

7. Appendix
Following i - 1 breaks from a Beta(1, ) stick-breaking process, the remaining length of the unit-length stick i-1 is i = j=1 (1 - Vj ). Let Sj := - ln(1 - Vj ). Then, since it can be shown that Sj  Exponential(), and i-1 therefore j=1 Sj  Gamma(i - 1, ), the value of i can be calculated using only one random variable,
i

= 

e-Ti Gamma(i - 1, )
i-1 i Vi ,

Ti

Therefore, to draw Vi j=1 (1 - Vj ) = sample Vi  Beta(1, ) and i as above.

one can