--PREPRINT-- Collapsed Variational Inference for HDP Yee Whye Teh Gatsby Unit University College London ywteh@gatsby.ucl.ac.uk Kenichi Kurihara Dept. of Computer Science Tokyo Institute of Technology kurihara@mi.cs.titech.ac.jp Max Welling ICS UC Irvine welling@ics.uci.edu Abstract A wide variety of Dirichlet-multinomial `topic' models have found interesting applications in recent years. While Gibbs sampling remains an important method of inference in such models, variational techniques have certain advantages such as easy assessment of convergence, easy optimization without the need to maintain detailed balance, a bound on the marginal likelihood, and side-stepping of issues with topic-identifiability. The most accurate variational technique thus far, namely collapsed variational LDA (CV-LDA)[1], did not deal with model selection nor did it include inference for hyperparameters. We address both issues by generalizing their technique, obtaining the first variational algorithm to deal with the HDP and to deal with hyperparameters of Dirichlet variables. Experiments show a very significant improvement in accuracy relative to CV-LDA. 1 Introduction Many applications of graphical models have traditionally dealt with discrete state spaces, where each variable is multinomial distributed given its parents [2]. Without strong prior knowledge on the structure of dependencies between variables and their parents, the typical Bayesian prior over parameters has been the Dirichlet distribution. This is because the Dirichlet prior is conjugate to the multinomial, leading to simple and efficient computations for both the posterior over parameters and the marginal likelihood of data. When there are latent or unobserved variables, the variational Bayesian approach to posterior estimation, where the latent variables are assumed independent from the parameters, has proven successful [3]. In recent years there has been a proliferation of graphical models composed of a multitude of multinomial and Dirichlet variables interacting in various inventive ways. The major classes include the latent Dirichlet allocation (LDA) [4] and many other topic models inspired by LDA, and the hierarchical Dirichlet process (HDP) [5] and many other nonparametric models based on the Dirichlet process (DP). LDA pioneered the use of Dirichlet distributed latent variables to represent shades of membership to different clusters or topics, while the HDP pioneered the use of nonparametric models to sidestep the need for model selection. For these Dirichlet-multinomial models the inference method of choice is typically collapsed Gibbs sampling, due to its simplicity, speed, and good predictive performance on test sets. However there are drawbacks as well: it is often hard to access convergence of the Markov chains, it is harder still to accurately estimate the marginal probability of the training data or the predictive probability of test data (if latent variables are associated with the test data), averaging topic-dependent quantities based on samples is not well-defined because the topic labels may have switched during sampling and avoiding local optima through `large MCMC moves' such as split and merge algorithms are tricky to implement due to the need to preserve detailed balance. Thus there seems to be a genuine need to consider alternatives to sampling. 1 For LDA and its cousins, there are alternatives based on variational Bayesian (VB) approximations [4] and on expectation propagation (EP) [6]. In [7] it was found that EP was not efficient enough for large scale applications, while VB suffered from significant bias resulting in worse predictive performance than Gibbs sampling. [1] addressed these issues by proposing an improved VB approximation for LDA based on the idea of collapsing, that is, integrating out the parameters while assuming that other latent variables are independent. As for nonparametric models, a number of VB approximations have been proposed for DP mixture models [8, 9], while to our knowledge none has been proposed for the HDP thus far. In this paper we investigate a new variational approach to inference for this class of Dirichletmultinomial models. To be concrete we focus our attention on an application of the HDP to topic modeling [5]. Our approach is an extension of the collapsed VB approximation for LDA presented in [1], and represents the first VB approximation to the HDP, which we shall call `collapsed variational HDP', or CV-HDP for short. The advantage of CV-HDP over CV-LDA is that the optimal number of variational components is not finite. This implies, apart from local optima, that we can keep adding components indefinitely while the algorithm will take care removing unnecessary clusters. Ours is also the first variational algorithm to treat posterior distributions over the hyperparameters of Dirichlet variables, and we show experimentally that this results in very significant improvements in both the variational bound and test-set likelihood. We expect our approach to be generally applicable to a wide variety of models in the same vein as LDA and HDP. 2 A Nonparametric Hierarchical Bayesian Model for LDA We consider a document model where each document in a corpus is modelled as a mixture over topics, and each topic is a distribution over words in the vocabulary, see Figure 1. Let there be D documents in the corpus, and W words in the vocabulary. For each document d = 1, . . . , D, let d be a vector of mixing proportions over topics. For each topic k , let k be a vector of probabilities for words in that topic. Words in each document are drawn as follows: first choose a topic k with probability dk , then choose a word w with probability kw . Let xid be the ith word token in document d, and zid its chosen topic. We have, zid | d Mult(d ) xid | zid , zid Mult(zid ) (1 ) We place Dirichlet priors on the parameters d and k , d | Dir( ) k | Dir( ) (2 ) where is the corpus-wide distribution over topics, is the corpus-wide distribution over the vocabulary, and and are concentration parameters describing how close d and k are to their respective prior means and . As we do not know the number of topics a priori, and would like a model that can determine this automatically, we consider a nonparametric extension reposed on the hierarchical Dirichlet process (HDP) formalism of [5]. Specifically, we have a countably infinite number of topics (thus d and are infinite-dimensional vectors), and we use a stick-breaking representation [10] for : k -1 ~ k | Beta(1, ) ~ fo r k = 1 , 2 , . . . (3 ) k = k l=1 (1 - l ) ~ In the normal Dirichlet process notation, we would equivantly have Gd DP(, G0 ) and G0 le DP( , Dir( )), where Gd = k=1 dk k and G0 = k=1 k k are sums of point masses, and Dir( ) is the base distribution. Finally, in addition to the prior over , we place priors over the other hyperparameters , , and of the model as well, Gamma(a , b ) Gamma(a , b ) Gamma(a , b ) Dir(a ) (4 ) 3 Collapsed Variational Bayesian Inference for HDP There is substantial empirical evidence that marginalizing out the variables , is helpful for efficient inference. For instance, in [11] it was observed that Gibbs sampling enjoys better mixing, while in [1] it was shown that variational inference is more accurate in this collapsed space. In the 2 d zid xid i = 1, . . . , nd d = 1, . . . , D k k = 1, . . . , Figure 1: Graphical model for CV-HDP. following we will build on this experience and propose a variational inference algorithm for the HDP based upon first collapsing out the parameters then re-expanding the system with auxiliary variables. The algorithm is fully Bayesian in the sense that all parameter posteriors are treated exactly and full posterior distributions are maintained for all hyperparameters. The only assumptions made are independencies among the latent topic variables and hyperparameters, and that there is a finite upper bound on the number of topics used (this upper bound is found automatically). The only inputs required of the modeler are the values of the top-level parameters a , b , .... 3.1 Replacing parameters with auxiliary variables In order to obtain efficient variational updates, we shall replace the parameters = {d } and = {k } with auxiliary variables. Specifically, we first integrate out the parameters; this gives a joint distribution over latent variables z = {zid } and word tokens x = {xid } as follows: p(z, x|, , , , ) = dD () (+nd·· ) (k +ndk· ) k =1 (k ) =1 K kK ( ) ( +n·k· ) =1 W w =1 ( w +n·kw ) ( w ) (5 ) with ndkw = #{i : xid = w, zid = k }, dot denoting sum over that index, and K denoting an index such that zid K for all i, d. The ratios of gamma functions in (5) result from the normalization constants of the Dirichlet densities of and , and prove to be nuisances for updating the hyperparameter posteriors. Thus we introduce four sets of auxiliary variables: d and k taking values in [0, 1], and sdk and tkw taking integral values. This results in a joint probability distribution over an expanded system, p(z, x, , , s, t|, , , , ) = dD Q n d -1 (1-d )nd·· -1 K 1 [ sdk· ](k )sdk k= dk (nd·· ) =1 kK Q n k -1 (1-k )n·k· -1 W=1 [ t·kw ]( w )tkw w kw (n·k· ) (6 ) =1 n where [m] are unsigned Stirling numbers of the first kind, and bold face letters denote sets of the corresponding variables. It can be readily verified that marginalizing out , , s and t reduces (6) to (5). The main insight is that conditioned on z and x the auxiliary variables are independent and have well-known distributions. Specifically, d and k are Beta distributed, while sdk (respectively tkw ) is the random number of occupied tables in a Chinese restaurant process with ndk· (respectively n·kw ) customers and a strength parameter of k (respectively w ) [12, 5]. Finally, we assume the following form for the variational posterior over the expanded system: q (z, , , s, t, , , , , ) = q ()q ( )q ( )q ( )q ( )q ( , , s, t|z) dD nd·· i q (zid ) (7 ) =1 =1 where the dependence of auxiliary variables on z is modelled exactly. Given the above factorization, q ( ) further factorizes so that the k 's are independent, as do the posterior over auxiliary variables. ~ For computational tractability, we also truncated our posterior representation to K topics. Specifically, we assumed that q (zid > K ) = 0 for every i and d. A consequence is that observations 3 have no effect on k and k for all k > K , and these parameters can be exactly marginalized out. ~ Notice that our approach to truncation is different from that in [8], who implemented a truncation at T by instead fixing the posterior for the stick weight q (vT = 1) = 1, and from that in [9], who assumed that the variational posteriors for parameters beyond the truncation level are set at their priors. Our truncation approximation is nested like that in [9], and unlike that in [8]. Our approach is also simpler than that in [9], which requires computing an infinite sum which is intractable in the case of HDPs. We shall treat K as a parameter of the variational approximation, possibly optimized by iteratively splitting or merging topics (though we have not explored these in this paper; see discussion section). As in [9], we reordered the topic labels such that E[n·1· ] > E[n·2· ] > · · · . An expression for the variational bound on the marginal log-likelihood is given in appendix A. 3.2 Variational Updates In this section we shall derive the complete set of variational updates for the system. In the following E[y ] denotes the expectation of y , G[y ] = eE[log y] the geometric expectation, and V[y ] = E[y 2 ] - g E[y ]2 the variance. Let (y ) = lo y (y) be the digamma function. We shall also employ index summation shorthands: · sums out that index, while >l sums over i where i > l. Hyperparameters. Updates for the hyperparameters are derived using the standard fully factorized variational approach, since they are assumed independent from each other and from other variables. For completeness we list these here, noting that , , are gamma distributed in the posterior, k 's ~ are beta distributed, and is Dirichlet distributed: q () a +E[s·· ]-1 e-(b - q ( ) a +E[t·· ]-1 - (b - P d E[log d ]) e P k E[log k ]) q ( ) a +K -1 e- (b - PK k =1 E[log(1-k )] ~ q (k ) k ·k (1 - k )E[ ]+E[s·>k ]-1 ~ ~ ~ W a +E[t·w ]-1 q ( ) w=1 w E[s ] (8 ) Auxiliary variables. The variational posteriors for the auxiliary variables depend on z and are conceptually similarly to the computation of the variational posterior over parameters in [1]. d and k are beta distributed. If ndk· = 0 then q (sdk = 0) = 1 otherwise q (sdk ) > 0 only if 1 sdk ndk· . Similarly for tkw . The posteriors are: q (d |z) d q (k |z) k E[]-1 In subsequent updates we will need averages and geometric averages of these quantities which can be extracted using the following identitPs, p(x) xa-1 e-bx E[x] = a/b, G[x] = e(a) /b, p(x) ie k ak -1 xk G[xk ] = e(ak ) /e( k ak ) . Note also that due to the assumption of independence of the variational posterior we have: G[k ] = G[]G[k ], G[ w ] = G[ ]G[w ] and G[k ] = k -1 G[k ] l=1 G[1 - l ]. ~ ~ (1 - d )nd·· -1 dk q (sdk = m|z) [nm · ] (G[k ])m ·k q (tkw = m|z) [nmw ] (G[ w ])m (9 ) E [ ] -1 (1 - k )n·k· -1 To obtain expectations of the auxiliary variables in (8) we will have to average over z as well. For d this is simply E[log d ] = (E[]) - (E[] + nd·· ) where nd·· is the (fixed) number of words in document d. Unfortunately, for the other auxiliary variables these expectations depend on counts which can take on many values and a na¨ve computation can be expensive. We derive computationi ally tractable approximations based upon an improvement to the Gaussian approximation in [1]. As we see in the experiments these approximations are very accurate. Consider E[log k ]. We have, E[log k |z] = (E[ ]) - (E[ ] + n·k· ) (1 0 ) and we need to average over n·k· as well. [1] tackled a similar problem with log instead of using a Gaussian approximation for n·k· and a second order Taylor expansion to log. Unfortunately such an approximation failed to work in our case as the digamma function (y ) diverges much more quickly than log y at y = 0. Our solution is to treat the case n·k· = 0 exactly, and approximate n·k· with a Gaussian when n·k· > 0. This leads to the following approximation: ( E[log k ] P+ [n·k· ] (E[ ]) - (E[ ] + E+ [n·k· ]) - 1 V+ [n·k· ] (E[ ] + E+ [n·k· ]) 11) 2 where P+ is the "probability of being positive" operator: P+ [y ] = q (y > 0), and E+ [y ], V+ [y ] are the expectation and variance conditional on y > 0. The other two expectations are derived similarly, 4 making use of the fact that sdk and tkw are distributionally equal to the random numbers of tables in Chinese restaurant processes: E[sdk ] G[k ]P+ [ndk· ] (G[k ] + E+ [ndk· ]) - (G[k ]) E + 1 V+ [ndk· ] (G[k ] + E+ [ndk· ]) 2 [tkw ] G[ w ]P+ [n·kw ] (G[ w ] + E+ [n·kw ]) - (G[ w ]) ( 1 + 2 V+ [n·kw ] (G[ w ] + E+ [n·kw ]) 12) As in [1], we can efficiently track the relevant quantities above by noting that each count is a sum of independent Bernoulli variables. Consider ndk· as an example. We keep track of three quantities: i i i log q (zid = k ) (13) q (zid = k )q (zid = k ) Z[ndk· ] = q (zid = k ) V[ndk· ] = E[ndk· ] = Some simple algebraic manipulations now show that: P+ [ndk· ] = 1 - eZ[ndk· ] E+ [ndk· ] = E[ndk· ] P+ [ndk· ] V+ [ndk· ] = V[ndk· ] P+ [ndk· ] - eZ[ndk· ] E+ [ndk· ] (1 4 ) Topic assignment variables. [1] showed that if the dependence of a set of variables, say A, on another set of variables, say z, is modelled exactly, then in deriving the updates for z we may equivalently integrate out A. Applying to our situation with A = {, , s, t}, we obtain updates similar to those in [1], except that the hyperparameters are replaced by either their expectations or their geometric expectations, depending on which is used in the updates for the corresponding auxiliary variables: G -1 G G d i G[ xid ] + n¬ixid q (zid = k ) G [k ] + n¬kd E[ ] + n¬i·d ·k d· ·k G G E -1 i d ¬ [ ] + E[n·ki·d ] [k ] + E[n¬kd ] [ xid ] + E[n¬ixid ] d· ·k - ( d i V[n¬ix ] V[n¬kd ] V[n¬i·d ] ·k d· · ex p 15) - 2(G[ ]+Eid ¬id ])2 + 2(E[ ]+Ekn¬id ])2 [n 2(G[ ]+E[n¬id ])2 [ k dk · xid ·kxid ·k· 4 Experiments We will report results on the following 3 datasets: A) KOS (vocabulary size W = 6906, nr. of documents D = 3430 and nr. of word-tokens N = 467, 714), B) a subset of the Reuters dataset consisting of news-topics with a number of documents larger than 300 (W = 4593, D = 8433, N = 566, 298), C) a subset of the 20Newsgroups dataset consisting of the topics `comp.os.mswindows.misc', `rec.autos', `rec.sport.baseball', `sci.space' and `talk.politics.misc' (W = 8424, D = 4 7 1 6 , N = 4 3 7 , 8 5 0 ). We have implemented 5 inference algorithms and compared their performances: 1) variational LDA (V-LDA) [4], collapsed variational LDA (CV-LDA) [1], collapsed variational HDP (CV-HDP, this paper), collapsed Gibbs sampling for LDA (G-LDA) [11] and the `direct assignment Gibbs sampler' for HDP (G-HDP) [5]. For G-HDP we obtained the code from Teh's website1 . The variables , are not adapted in that code, so we fixed them at = 100 and w = 1/W for all algorithms (see below for discussion regarding adapting these in CV-HDP). G-HDP was initialized with either 1 topic (G-HDP1 ) or with 100 topics (G-HDP100 ). For CV-HDP we used the following initialization: E[ ] = G[ ] = 100 and G[w ] = 1/W (kept fixed to compare with G-HDP), E[] = a /b , G[] = e(a ) /b , E[ ] = a /b , G[k ] = 1/K and q (zij = k ) 1 + u with u U [0, 1]. We set2 hyperparameters a , b , a , b in the range between [2, 6], while a , b was chosen in the range [5, 10] and a in [30 - 50]/W . The number http://www.gatsby.ucl.ac.uk/ywteh/research/software.html We actually set these values using a fixed but somewhat elaborate scheme which is the reason they ended up different for each dataset. Note that this scheme simply converts prior expectations about the number of topics and amount of sharing into hyperparameter values, and that they were never tweaked. Since they always ended up in these compact ranges and since we do not expect a strong dependence on their values inside these ranges we choose to omit the details. 2 1 5 of topics used in CV-HDP was truncated at 40, 80, and 120 topics, corresponding to the number of topics used in the LDA algorithms. Finally, for all LDA algorithms we used = 0.1, = 1/K . Performance was evaluated by comparing i) the in-sample (train) variational bound on the loglikelihood for all three variational methods and ii) by computing out-of-sample (test) log-likelihood for all five methods. All inference algorithms were run on 90% of the words in each document while test-set performance was evaluated on the remaining 10% of the words. Test-set log-likelihood was computed as follows for the variational methods: p(xtest ) = i k ¯¯ j k kxtiejst k + Eq [nj k· ] ¯ j k = + Eq [nj ·· ] w + Eq [n·kw ] ¯ kw = + Eq [n·k· ] (1 6 ) j Note that we used estimated mean values of j k and kw [13]. For CV-HDP we replaced all hyperparameters by their expectations. For the Gibbs sampling algorithms, given S samples from the posterior, we used: p(xtest ) = i S k 1s s j k s xtest k ij S =1 s j k = j k + nsk· j + ns·· j s w = k w + nskw · + nsk· · (1 7 ) s and replacing s , k k for G-HDP. We used all samples obtained by the Gibbs sampling algorithms after an initial burn-in period; each point in the predictive probabilities plots below is obtained from the set of samples collected thus far. The results, shown in Figure 2, display a remarkable improvement in accuracy of CV-HDP over CV-LDA, both in terms of the bound on the training log-likelihood as well as for the test-set loglikelihood. This is caused by the fact that CV-HDP is learning the variational distributions over the hyperparameters. We note that we have not trained or for any of these methods. In fact, initial results for CV-HDP show no additional improvement in test-set log-likelihood, in some cases even a deterioration of the results. A second observation is that convergence of all variational methods is faster than for the sampling methods. Thirdly, we see significant local optima effects in our simulations. For example, G-HDP100 achieves the best results, better than G-HDP1 , indicating that pruning topics is a better way than adding topics to escape local optima in these models and leads to better posterior modes. In further experiments we have also found that the variational methods benefit from better initializations due to local optima. In Figure 3 we show results when the variational methods were initialized at the last state obtained by G-HDP100 . We see that indeed the variational methods were able to find significantly better local optima in the vicinity of the one found by G-HDP100 , and that CV-HDP is still consistently better than the other variational methods. 5 Discussion In this paper we have explored collapsed variational inference for the HDP. As discussed in the introduction, we believe there are advantages to variational approximations that are not available to sampling methods. CV-HDP presents an improvement over CV-LDA [1] in two ways: 1) by inferring posterior distributions over the higher level variables in the model we have shown a significant improvement in both test-set likelihood and the variational bound, 2) by taking the infinite topic-limit in the generative model and truncating the variational posterior in a way that guarantees `nesting', we know that the variational bound on the marginal log-likelihood will reach its highest value (ignoring local optima issues) for K . This fact should facilitate the search over topics, e.g. by splitting and merging topics, an aspect that we have not yet fully explored, and for which we expect to gain significantly from in the face of the observed local optima issues. We expect the proliferation of topic-style Dirichlet-multinomial models and their many exciting applications to continue. For some applications variational approximations may prove to be the most convenient tool for inference. We are convinced that the methods presented here are applicable to this general class of models and we hope to provide general purpose software to support inference in these models in the future. 6 -7.2 -7.4 -7.6 -7.8 -8 -5.8 -6 -6.2 -6.4 -6.6 40 80 K 120 40 80 K 120 -6.8 -7 GHDP 100 -7.2 -7.4 GHDP1 GLDA CVHDP CVLDA VLDA 40 80 K 120 -7.2 -7.4 -5.8 -6 -6.2 -6.8 -7 -7.2 GHDP 100 -7.6 -7.8 -6.4 -6.6 -6.8 -7.4 -7.6 -7.8 0 4000 8000 12000 #steps -8 0 4000 8000 12000 #steps GHDP1 GLDA CVHDP CVLDA VLDA -8 0 4000 8000 12000 #steps -7 -7.6 -6.4 -7.8 -6.6 -8 -6.8 -8.2 -7 -8.4 -7.4 -7.6 -7.8 -8 -8.2 CVHDP CVLDA VLDA 40 80 K 120 40 80 K 120 40 80 K 120 Figure 2: Left column: KOS, Middle column: Reuters and Right column: 20Newsgroups. Top row: log p(xtest ) as a function of K , Middle row: log p(xtest ) as a function of number of steps (defined as number of iterations multiplied by K ) and Bottom row: variational bounds as a function of K . Log probabilities are on a per word basis. Shown are averages and standard errors obtained by repeating the experiments 10 times with random restarts. The distribution over the number of topics found by G-HDP1 are: KOS: K = 113.2 ± 11.4, Reuters: K = 60.4 ± 6.4, 20News: K = 83.5 ± 5.0. For G-HDP100 we have: KOS: K = 168.3 ± 3.9, Reuters: K = 122.2 ± 5.0, 20News: K = 128.1 ± 6.6. 7 -7 -6.6 variational bound -7.5 -8 -8.5 -9 0 5000 10000 log p(test) / N -6.8 -7 -7.2 -7.4 -7.6 -7.8 0 5000 10000 GHDP100 Gibbs init. CVHDP Gibbs init. CVLDA Gibbs init. VLDA random init. CVHDP random init. CVLDA random init. VLDA #steps #steps Figure 3: G-HDP100 initialized variational methods (K = 130), compared against variational methods initialized in the usual manner with K = 130 as well. Results were averaged over 10 repeats. A Variational lower bound The variational lower bound on log likelihood is: P E[log p(z,x|, , )-log q(z)]-KL[q() p()]-KL[q() p( )]- K 1 KL[q(k ) p(~ k )]-KL[q( ) p( )] ~ (1 8 ) k= h h iP h iP i P P (G[]G[k ]+ndk· ) (G[ ]G[w ]+n·kw ) (E[]) (E[ ]) = d log (E[]+n ) + dk F log + k F log (E[ ]+n ) + kw F log (G[]G[ ]) (G[ ]G[w ]) d·· k ·k· -log -log - P (b - P P a +E[s·· ] P (a ) d E[log d ]) G[]E[s·· ] eE[] d E[log d ] - dk a (a +E[s·· ]) b Pn d i=1 q(zid =k) log q(zid =k) (b - P a +E[t·· ] P (a ) k E[log k ]) E[t·· ] E[ ] k E[log k ] e a (a +E[t·· ]) G[ ] b k log (1+ +E[s·k ]+E[s·>k ]) (+E[t·· ]) ~ E[s·k ] G[1-k ]E[s·>k ] -log ~ (1+E[s·k ])( +E[s·>k ]) G[k ] () Q (w ) E[t·w ] w (w +E[t·w ]) G[w ] 1 where F[f (n)]=P+[n](f (E+ [n])+ 2 V+ [n]f (E+ [n])) is the improved second order approximation. References [1] Y. W. Teh, D. Newman, and M. Welling. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems, volume 19, 2007. [2] R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilistic Networks and Expert Systems. Springer-Verlag, 1999. [3] M. J. Beal and Z. Ghahramani. Variational Bayesian learning of directed graphical models with hidden variables. Bayesian Analysis, 1(4), 2006. [4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993­1022, 2003. [5] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566­1581, 2006. [6] T. P. Minka and J. Lafferty. Expectation propagation for the generative aspect model. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, volume 18, 2002. [7] W. Buntine and A. Jakulin. Applying discrete PCA in data analysis. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, volume 20, 2004. [8] D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1):121­144, 2006. [9] K. Kurihara, M. Welling, and N. Vlassis. Accelerated variational DP mixture models. In Advances in Neural Information Processing Systems, volume 19, 2007. [10] J. Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 4:639­650, 1994. [11] T.L. Griffiths and M. Steyvers. A probabilistic approach to semantic representation. In Proceedings of the 24th Annual Conference of the Cognitive Science Society, 2002. [12] C. E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, 2(6):1152­1174, 1974. [13] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. 8