Nonparametric Factor Analysis with Beta Process Priors John Paisley Lawrence Carin Department of Electrical & Computer Engineering Duke University, Durham, NC 27708 jwp4@ee.duke.edu lcarin@ee.duke.edu Abstract We propose a nonparametric extension to the factor analysis problem using a beta process prior. This beta process factor analysis (BPFA) model allows for a dataset to be decomposed into a linear combination of a sparse set of factors, providing information on the underlying structure of the observations. As with the Dirichlet process, the beta process is a fully Bayesian conjugate prior, which allows for analytical posterior calculation and straightforward inference. We derive a variational Bayes inference algorithm and demonstrate the model on the MNIST digits and HGDP-CEPH cell line panel datasets. is closely related to the Indian buffet process (Griffiths & Ghahramani, 2005; Thibaux & Jordan, 2007). An example of a latent feature model is the factor analysis model (West, 2003), where a data matrix is decomposed into the product of two matrices plus noise, X = Z + E (1) In this model, the columns of the D × K matrix of factor loadings, , can be modeled as latent features and the elements in each of N columns of Z can be modeled as indicators of the possession of a feature for the corresponding column of X (which can be given an associated weight). It therefore seems natural to seek a nonparametric model for this problem. To this end, several models have been proposed that use the Indian buffet process (IBP) (Knowles & Ghahramani, 2007; Rai & Daum´, 2008; Meeds et al., e 2007). However, these models require MCMC inference, which can be slow to converge. In this paper, we propose a beta process factor analysis (BP-FA) model that is fully conjugate and therefore has a fast variational solution; this is an intended contribution of this paper. Starting from first principles, we show how the beta process can be formulated to solve the nonparametric factor analysis problem, as the Dirichlet process has been previously shown to solve the nonparametric mixture modeling problem; we intend for this to be a second contribution of this paper. The remainder of the paper is organized as follows. In Section 2 we review the beta process in detail. We introduce the BP-FA model in Section 3, and discuss some of its theoretical properties. In Section 4 we derive a variational Bayes inference algorithm for fast inference, exploiting full conjugacy within the model. Experimental results are presented in Section 5 on synthetic data, and on the MNIST digits and HGDPCEPH cell line panel (Rosenberg et al., 2002) datasets. We conclude and discuss future work in Section 6. 1. Introduction Latent membership models provide a useful means for discovering underlying structure in a dataset by elucidating the relationships between observed data. For example, in latent class models, observations are assumed to be generated from one of K classes, with mixture models constituting a classic example. When a single class indicator is considered too restrictive, latent feature models can be employed, allowing for an observation to possess combinations of up to K latent features. As K is typically unknown, Bayesian nonparametric models seek to remove the need to set this value by defining robust, but sparse priors on infinite spaces. For example, the Dirichlet process (Ferguson, 1973) allows for nonparametric mixture modeling in the latent class scenario. In the latent feature paradigm, the beta process (Hjort, 1990) has been defined and can be used toward the same objective, which, when marginalized, Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Nonparametric Factor Analysis with BP Priors 2. The Beta Process The beta process, first introduced by Hjort for survival analysis (Hjort, 1990), is an independent increments, or L´vy process and can be defined as follows: e Definition: Let be a measurable space and B its algebra. Let H0 be a continuous probability measure on (, B) and a positive scalar. Then for all disjoint, infinitesimal partitions, {B1 , . . . , BK }, of the beta process is generated as follows, H(Bk ) Beta(H0 (Bk ), (1 - H0 (Bk ))) (2) We briefly review this marginalization, discussing the link to the Indian buffet process (Griffiths & Ghahramani, 2005) as well as other theoretical properties of the beta process that arise as a result. We first extend the beta process to take two scalar parameters, a, b, and partition into K regions of equal measure, or H0 (Bk ) = 1/K for k = 1, . . . , K. We can then write the generative process in the form of (3) as K H(B) k = k=1 k Bk (B) (5) with K and H0 (Bk ) 0 for k = 1, . . . , K. This process is denoted H BP (H0 ). Because of the convolution properties of beta random variables, the beta process does not satisfy the Kolmogorov consistency condition, and is therefore defined in the infinite limit (Billingsley, 1995). Hjort extends this definition to include functions, (Bk ), which for simplicity is here set to a constant. Like the Dirichlet process, the beta process can be written in set function form, Beta (a/K, b(K - 1)/K) where B {B1 , . . . , BK }. Marginalizing the vector and letting K , the matrix, Z, can be generated directly from the beta process prior as follows: 1. For an infinite matrix, Z, initialized to all zeros, set the first c1 P o(a/b) rows of z1 to 1. Sample the associated locations, i , i = 1, . . . , c1 , independently from H0 . 2. For observation N , sample cN P o define CN of zN , sample N i=1 ci . a b+N -1 H() = k=1 k k () (3) and For rows k = 1, . . . , CN -1 nN k b+N -1 with H(i ) = i . Also like the Dirichlet process, means for drawing H are not obvious. We briefly discuss this issue in Section 2.1. In the case of the beta process, does not serve as a probability mass function on , but rather as part of a new measure on that parameterizes a Bernoulli process defined as follows: Definition: Let the column vector, zi , be infinite and binary with the k th value, zik , generated by zik Bernoulli(k ) (4) zN k Bernoulli N -1 (6) where nN k i=1 zik , the number of previous observations with a 1 at location k. Set indices CN -1 + 1 to CN equal to 1 and sample associated locations independently from H0 . If we define The new measure, Xi () = k zik k (), is then drawn from a Bernoulli process, or Xi BeP (H). By arranging samples of the infinite-dimensional vector, zi , in matrix form, Z = [z1 , . . . , zN ], the beta process is seen to be a prior over infinite binary matrices, with each row in the matrix Z corresponding to a location, . 2.1. The Marginalized Beta Process and the Indian Buffet Process As previously mentioned, sampling H directly from the infinite beta process is difficult, but a marginalized approach can be derived in the same manner as the corresponding Chinese restaurant process (Aldous, 1985), used for sampling from the Dirichlet process. H() k=1 nN k () b+N -1 k (7) then H BP (a, b, H0 ) in the limit as N , and the exchangeable columns of Z are drawn iid from a beta process. As can be seen, in the case where b = 1, the marginalized beta process is equivalent to the Indian buffet process (Thibaux & Jordan, 2007). This representation can be used to derive some interesting properties of the beta process. We observe that the random variable, CN , has a Poisson distribution, N a CN P o i=1 b+i-1 , which provides a sense of how the matrix Z grows with sample size. FurtherN a more, since i=1 b+i-1 as N , we can deduce that the entire space of will be explored as the number of samples grows to infinity. Nonparametric Factor Analysis with BP Priors 3. Beta Process Factor Analysis Factor analysis can be viewed as the process of modeling a data matrix, X RD×N , as the multiplication of two matrices, RD×K and Z RK×N , plus an error matrix, E. X = Z + E (9) Often, prior knowledge about the structure of the data is used, for example, the desired sparseness properties of the or Z matrices (West, 2003; Rai & Daum´, e 2008; Knowles & Ghahramani, 2007). The beta process is another such prior that achieves this sparseness, allowing for K to tend to infinity while only focusing on a small subset of the columns of via the sparse matrix Z. Figure 1. Estimation of from 5000 marginal beta process runs of 500 samples each, with various a, b initializations. We show in Figure 1 the expectation of calculated empirically by drawing from the marginalized beta process. As can be seen, the a, b parameters offer flexibility in both the magnitude and shape of and can be tuned. 2.2. Finite Approximation to the Beta Process As hinted in (5), a finite approximation to the beta process can be made by simply setting K to a large, but finite number. This approximation can be viewed as serving a function similar to the finite Dirichlet distribution in its approximation of the infinite Dirichlet process for mixture modeling. The finite representation is written as K In beta process factor analysis (BP-FA), we model the matrices and Z as N draws from a Bernoulli process parameterized by a beta process, H. First, we recall that draws from the BeP-BP approximation can be generated as zik k k iid Bernoulli(k ) Beta(a/K, b(K - 1)/K) H0 (10) H() k k = k=1 k k () Beta(a/K, b(K - 1)/K) H0 (8) for observation i = 1, . . . , N and latent feature (or factor) k = 1, . . . , K. In the general definition, H0 was unspecified, as was the use of the latent membership vector, zi . For BP-FA, we let H0 be multivariate normal and the latent factors be indicators of linear combinations of these locations, which can be written in matrix notation as zi , where = [1 , . . . , K ]. Adding the noise vector, i , we obtain observation xi . The beta process can thus be seen as a prior on the parameters, {, }, with iid Bernoulli process samples composing the expectation matrix, E[X] = Z for the factor analysis problem. As an unweighted linear combination might be too restrictive, we include a weight vector, wi , which results in the following generative process for observation i = 1, . . . , N , xi wi zik k k i iid with the K-dimensional vector, zi , drawn from a finite Bernoulli process parameterized by H. The full conjugacy of this representation means posterior computation is analytical, which will allow for variational inference to be performed on the BP-FA model. We briefly mention that a stick-breaking construction of the beta process has recently been derived (Paisley & Carin, 2009), allowing for exact Bayesian inference. A construction for the Indian buffet process has also been presented (Teh et al., 2007), though this method does not extend to the more general beta process. We will use the finite approximation presented here in the following sections. = (zi wi ) + i 2 N (0, w I) Bernoulli(k ) Beta(a/K, b(K - 1)/K) N (0, ) 2 N (0, n I) (11) for k = 1, . . . , K and all values drawn independently. The symbol represents the Hadamard, or elementwise multiplication of two vectors. We show a graphical representation of the BP-FA model in Figure 2. Nonparametric Factor Analysis with BP Priors 4. Variational Bayesian Inference In this section, we derive a variational Bayesian algorithm (Beal, 2003) to perform fast inference for the weighted BP-FA model of (11). This is aided by the conjugacy of the beta to the Bernoulli process, where the posterior for the single parameter beta process is N H|X1 , . . . , XN BP H0 + i=1 Xi (14) Figure 2. A graphical representation of the BP-FA model. with Xi BeP (H) being the ith sample from a Bernoulli process parameterized by H. The twoparameter extension has a similar posterior update, though not as compact a written form. -k -k In the following, we define x-k xi --k (zi wi ), i -k -k where -k , z and w are the matrix/vectors with the k th column/element removed; this is simply the portion of xi remaining considering all but the k th factor. Also, for clarity, we have suppressed certain equation numbers and conditional variables. Written in matrix notation, the weighted BP-FA model of (11) is thus a prior on X = (Z W ) + E (12) Under this prior, the mean and covariance of a given vector, x, can be calculated, E[x] = 0 E[xxT ] = aK 2 2 + n I a + b(K - 1) w (13) 4.1. The VB-E Step Update for Z: -k -k p(zik |xi , , wi , zi ) p(xi |zik , , wi , zi )p(zik |) The probability that zik = 1 is proportional to exp[ ln(k ) ] × exp - 1 2 2n 2 wik T k - 2 wik k k T 2 2 Letting K , we see that E[xxT ] a w + n I. b Therefore, the BP-FA model remains well-defined in the infinite limit. To emphasize, compare this value 2 2 with z removed, where E[xxT ] = Kw + n I. The a coefficient b is significant in that it represents the expected number of factors present in an observation as K . That is, if we define mi k=1 zik , where zi BeP (H) and H BP (a, b, H0 ), then by marginalizing H we find that E[mi ] = a . b x-k i where ˇ indicates the expectation. The probability that zik = 0 is proportional to exp[ ln(1 - k ) ]. The expectations can be calculated as ln(k ) = ln(1 - k ) = b(K - 1) + N - nk K - a + b(K - 1) +N K a + nk K - a + b(K - 1) +N K Another important aspect of the BP-FA model is that the vector enforces sparseness on the same subset of factors. In comparison, consider the model where zi is removed and sparseness is enforced by sampling the elements of wi iid from a sparseness inducing normalgamma prior. This is equivalent to learning multiple relevance vector machines (Tipping, 2001) with a jointly learned and shared matrix. A theoretical issue with this model is that the prior does not induce sparseness on the same subset of latent factors. As K , all factors will be used sparsely with equal probability and, therefore, no factors will be shared. This is conceptually similar to the problem of drawing multiple times from a Dirichlet process prior, where individual draws are sparse, but no two draws are sparse on the same subset of atoms. We note that the hierarchical Dirichlet process has been introduced to resolve this particular issue (Teh et al., 2006). where (ˇ) represents the digamma function and 2 wik T k k = = wik 2 + i k T k + trace(k ) (k) (15) (16) where nk is defined in the update for , k in the (k) update for , and i is the k th diagonal element of i defined in the update for W . Nonparametric Factor Analysis with BP Priors 4.2. The VB-M Step Update for : p(k |Z) p(Z|k )p(k |a, b, K) The posterior of k can be shown to be k Beta a b(K - 1) + nk , + N - nk K K N 2 Update for n : 2 2 2 p(n |X, , Z, W ) p(X|, Z, W, n )p(n ) 2 We can also infer the noise parameter, n , by using an inverse-gamma, InvGa(c, d), prior. The posterior can be shown to be inverse-gamma with c d where = c+ = d+ ND 2 1 2 N (22) xi - ( zi wi ) 2 where nk = i=1 zik can be calculated from the VB-E step. The priors a, b can be tuned according to the discussion in Section 2.1. We recall that N a i=1 b+i-1 is the expected total number of factors, while a/b is the expected number of factors used by a single observation in the limiting case. Update for : p(k |X, -k , Z, W ) p(X|k , -k , Z, W )p(k |) The posterior of k can be shown to be normal with mean, ľk , and covariance, k , equal to k ľk = 1 2 n k N -1 + i i=1 K i k=1 2 zik wik T k - zik k 2 wik T 2 k T k + k=l zik zil i,kl k l -2 In the previous equations, n can then be replaced -2 by n = c /d . zik i=1 2 wik I + -1 (17) 2 Update for w : 2 2 2 p(w |W ) p(W |w )p(w ) = 1 2 n N zik wik i=1 x-k i (18) Given a conjugate, InvGa(e, f ) prior, the posterior of 2 w is also inverse-gamma with e f = e+ = f+ NK 2 1 2 N (23) wi T 2 with wik given in (15). The prior can be set to the empirical covariance of the data, X. wi + trace(i ) (24) i=1 Update for W: 2 p(wi |xi , , zi ) p(xi |wi , , zi )p(wi |w ) The posterior of wi can be shown to be multivariate normal with mean, i , and covariance, i , equal to i i = = 1 ~T ~ 1 i i + 2 I 2 n w 1 ~ T i xi i 2 n -1 4.3. Accelerated VB Inference As with the Dirichlet process, there is a tradeoff in variational inference for the BP-FA; the larger K is set, the more accurate the model should be, but the slower the model inference. We here briefly mention a simple remedy for this problem. Following every iteration, the total factor membership expectations, { nk }K , can be used to assess the relk=1 evancy of a particular factor. When this number falls below a small threshold (e.g., 10-16 ), this factor index can be skipped in following iterations with minimal impact on the convergence of the algorithm. In this way, the algorithm should converge more quickly as the number of iterations increases. 4.4. Prediction for New Observations Given the outputs, {, }, the vectors z and w can be inferred for a new observation, x , using a MAPEM inference algorithm that iterates between z and w . The equations are similar to those detailed above, with inference for and removed. (19) (20) ~ ~ ~ where we define i Zi and Zi [zi , . . . , zi ]T , with the K-dimensional vector, zi , repeated D times. ~ ~ Given that i = Zi , we can then calculate ~i ~ T i = T +A zi zi T + Bi (21) where A and Bi are calculated as follows A diag [trace(1 ), . . . , trace(K )] Bi diag [ zi1 (1 - zi1 ), . . . , ziK (1 - ziK )] 2 A prior, discussed below, can be placed on w , removing the need to set this value. Nonparametric Factor Analysis with BP Priors 5. Experiments Factor analysis models are useful in many applications, for example, for dimensionality reduction in gene expression analysis (West, 2003). In this section, we demonstrate the performance of the BP-FA model on synthetic data, and apply it to the MNIST digits and HGDP-CEPH cell line panel (Rosenberg et al., 2002) datasets. 5.1. A Synthetic Example For our synthetic example, we generated H from the previously discussed approximation to the Beta process with a, b = 1, K = 100 and k N (0, I) in a D = 25 dimensional space. We generated N = 250 samples from a Bernoulli process parameterized by H 2 and synthesized X with W = 1 and n = 0.0675. Below, we show results for the model having the highest likelihood selected from five runs, though the results in general were consistent. In Figure 3, we display the ground truth (top) of Z, rearranged for display purposes. We note that only seven factors were actually used, while several observations contain no factors at all, and thus are pure noise. We initialized our model to K = 100 factors, though as the results show (bottom), only a small sub2 set were ultimately used. The inferred n = 0.0625 and the elementwise MSE of 0.0186 to the true Z further indicates good performance. For this example, the BP-FA model was able to accurately uncover the underlying latent structure of the dataset. 5.2. MNIST Handwritten Digits Dataset We trained our BP-FA model on N = 2500 odd digits (500 each) from the MNIST digits dataset. Using PCA, we reduced the dimensionality to D = 350, which preserved over 99.5% of the total variance within the data. We truncated the BP-FA model to K = 100 factors initialized using the K-means algorithm and ran five times, selecting the run with the highest likelihood, though again the results were consistent. In Figure 4 below, we show the factor sharing across the digits (left) by calculating the expected number of factors shared between two observations and normalizing by the largest value (0.58); larger boxes indicate more sharing. At right, we show for each of the odd digits the most commonly used factor, followed by the second most used factor given the factor to the left. Of particular interest are the digits 3 and 5, where they heavily share the same factor, followed by a factor that differentiates the two numbers. In Figure 5 (top), we plot the sorted values of inferred by the algorithm. As can be seen, the algorithm inferred a sparse set of factors, fewer than the 100 initially provided. Also in Figure 5 (bottom), we show an example of a reconstruction of the number 3 that uses four factors. As can be seen, no single factor can individually approximate the truth as well as their weighted linear combination. We note that the BP-FA model was fast, requiring 35 iterations on average to converge and requiring approximately 30 minutes for each run on a 2.66 GHz processor. Figure 3. Synthetic Data: Latent factor indicators, Z, for the true (top) and inferred (bottom) models. Figure 4. Left: Expected factor sharing between digits. Right: (left) Most frequently used factors for each digit (right) Most used second factor per digit given left factor. Nonparametric Factor Analysis with BP Priors Figure 5. (top) Inferred indicating sparse factor usage. (bottom) An example reconstruction. Figure 7. Variance of HGDP-CEPH data along the first 150 principal components of the raw features for original and reconstructed data. 5.3. HGDP-CEPH Cell Line Panel The HGDP-CEPH Human Genome Diversity Cell Line Panel (Rosenberg et al., 2002) is a dataset comprising genotypes at D = 377 autosomal microsatellite loci sampled from N = 1056 individuals in 52 populations across the major geographic regions of the world. It is useful for inferring human evolutionary history and migration. We ran our model on this dataset initializing K = 100 factors, though again, only a subset were significantly used. Figure 6 contains the sharing map, as previously calculated for the MNIST dataset, normalized on 0.55. We note the slight differentiation between the Middle East and European regions, a previous issue for this dataset (Rosenberg et al., 2002). We also highlight the use of BP-FA in denoising. Figure 8 shows the original HGDP-CEPH data, as well as the (Z W ) reconstruction projected onto the first 20 principal components of the raw data. The figure shows how the BP-FA model was able to substantially reduce the noise level within the data, while still retaining the essential structure. This is also evident in Figure 7, where we plot the variance along these same principal components for the first 150 dimensions. For an apparently noisy dataset such as this, BP-FA can potentially be useful as a preprocessing step in conjunction with other algorithms, in this case, for example, the Structure (Rosenberg et al., 2002) or recently proposed mStruct (Shringarpure & Xing, 2008) algorithms. 6. Conclusion and Future Work We have proposed a beta process factor analysis (BPFA) model for performing nonparametric factor analysis with a potentially infinite number of factors. As with the Dirichlet process prior used for mixture modeling, the beta process is a fully Bayesian prior that assures the sharing of a sparse subset of factors among all observations. Taking advantage of conjugacy within the model, a variational Bayes algorithm was developed for fast model inference requiring an approximation comparable to the finite Dirichlet distribution's approximation to the infinite Dirichlet process. Results were shown on synthetic data, as well as the MNIST handwritten digits and HGDP-CEPH cell line panel datasets. While several nonparametric factor analysis models have been proposed for applications such as independent components analysis (Knowles & Ghahramani, 2007) and gene expression analysis (Rai & Daum´, e 2008; Meeds et al., 2007), these models rely on the Indian buffet process and therefore do not have fast variational solutions - an intended contribution of this paper. Furthermore, while the formal link has been made between the IBF and the beta process (Thibaux & Jordan, 2007), we believe our further development and application to factor analysis to be novel. In future work, the authors plan to develop a stick-breaking process for drawing directly from the beta process similar to (Sethuraman, 1994) for drawing from the Dirichlet process, which will remove the need for finite approximations. Figure 6. Factor sharing across geographic regions. Nonparametric Factor Analysis with BP Priors Figure 8. HGDP-CEPH features projected onto the first 20 principal components of the raw features for the (top) original and (bottom) reconstructed data. The broad geographic breakdown is given between the images. References Aldous, D. (1985). Exchangeability and related topics. ´ Ecole d'ete de probabilit´s de Saint-Flour XIII-1983, e 1­198. Beal, M. (2003). Variational algorithms for approximate bayesian inference. Doctoral dissertation, Gatsby Computational Neuroscience Unit, University College London. Billingsley, P. (1995). Probability and measure, 3rd edition. Wiley Press, New York. Ferguson, T. (1973). A bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209­230. Griffiths, T. L., & Ghahramani, Z. (2005). Infinite latent feature models and the indian buffet process. Advances in Neural Information Processing Systems (pp. 475­482). Hjort, N. L. (1990). Nonparametric bayes estimators based on beta processes in models for life history data. Annals of Statistics, 18:3, 1259­1294. Knowles, D., & Ghahramani, Z. (2007). Infinite sparse factor analysis and infinite independent components analysis. 7th International Conference on Independent Component Analysis and Signal Separation. Meeds, E., Ghahramani, Z., Neal, R., & Roweis, S. (2007). Modeling dyadic data with binary latent factors. Advances in Neural Information Processing Systems (pp. 977­984). Paisley, J., & Carin, L. (2009). A stick-breaking construction of the beta process (Technical Report). Duke University, ee.duke.edu/jwp4/StickBP.pdf. Rai, P., & Daum´, H. (2008). The infinite hierarchical e factor regression model. Advances in Neural Information Processing Systems. Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A., & Feldman, M. W. (2002). Genetic structure of human populations. Science, 298, 2381­2385. Sethuraman, J. (1994). A constructive definition of dirichlet priors. Statistica Sinica, 4:639­650. Shringarpure, S., & Xing, E. P. (2008). mstruct: A new admixture model for inference of population structure in light of both genetic admixing and allele mutation. Proceedings of the 25th International Conference on Machine Learning (pp. 952­959). Teh, Y. W., G¨r¨r, D., & Ghahramani, Z. (2007). ou Stick-breaking construction for the indian buffet process. Proceedings of the International Conference on Artificial Intelligence and Statistics. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association, 101:1566­ 1581. Thibaux, R., & Jordan, M. I. (2007). Hierarchical beta processes and the indian buffet process. International Conference on Artificial Intelligence and Statistics. Tipping, M. E. (2001). Sparse bayesian learning and the relevance vector machine. The Journal of Machine Learning Research, 1:211­244. West, M. (2003). Bayesian factor regression models in the "large p, small n" paradigm. Bayesian Statistics, 7, 723­732.