Hybrid Variational/Gibbs Collapsed Inference in Topic Models Max Welling Dept. Computer Science University of California Irvine Irvine, CA, USA welling@ics.uci.edu Yee Whye Teh Gatsby Computational Neuroscience Unit University College London London, UK ywteh@gatsby.ucl.ac.uk Bert Kappen Dept. Biophysics Radboud University Nijmegen Nijmegen, The Netherlands b.kappen@science.ru.nl Abstract Variational Bayesian inference and (collapsed) Gibbs sampling are the two important classes of inference algorithms for Bayesian networks. Both have their advantages and disadvantages: collapsed Gibbs sampling is unbiased but is also inefficient for large count values and requires averaging over many samples to reduce variance. On the other hand, variational Bayesian inference is efficient and accurate for large count values but suffers from bias for small counts. We propose a hybrid algorithm that combines the best of both worlds: it samples very small counts and applies variational updates to large counts. This hybridization is shown to significantly improve testset perplexity relative to variational inference at no computational cost. Steyvers, 2002; Buntine, 2002) that Gibbs sampling in a collapsed state space where the CPTs have been marginalized out leads to efficient inference. It is expected that this also holds more generally true for discrete Bayesian networks. Variational Bayesian (VB) approximations have also been applied to topic models (Blei et al., 2003) but predictive probability results have consistently been inferior to collapsed Gibbs sampling (CGS). More recently, variational approximations have been extended to operate in the same collapsed state space of CGS (Teh et al., 2006; Teh et al., 2008). These collapsed variational Bayesian (CVB) inference algorithms improve upon VB but still lag behind CGS. In this paper we will propose a hybrid inference scheme that combines CGS with VB approximations. The idea is to split all data-cases into two sets, the ones that will be treated variationally and the ones that will be treated through sampling. The two approximations interact in a consistent manner in the sense that both VB and CGS updates are derived from a single objective function. The advantage of the VB updates is that they scale better computationally. We show empirically that hybrid algorithms achieve almost the same accuracy as CGS because they are only applied where they are expected to work well. The algorithm can be seen to trade off bias with variance in a flexible and tunable manner. 1 Introduction Bayesian networks (BNs) represent an important modeling tool in the field of artificial intelligence and machine learning (Heckerman, 1999). In particular the subclass of BNs known as "topic models" is receiving increasing attention due to its success in modeling text as a bag-of-words and images as a bag-of-features (Blei et al., 2003). Unlike most applications of Bayesian networks, we will be interested in "Bayesian Bayesian networks" where we also treat the conditional probability tables (CPTs) as random variables (RVs). The key computational challenge for these models is inference, namely estimating the posterior distribution over both parameters and hidden variables, and ultimately estimating predictive probabilities and the marginal log-likelihood (or evidence). It has been argued theoretically (Castella & Robert, 1996) and observed empirically in topic models (Griffiths & On Sabbatical at Radboud University, Netherlands, Department of Biophysics. 2 Topic Models as Bayesian Networks We will assume that all visible and hidden variables are discrete. However, we expect the results to hold more generally for models in the exponentially family. We will first develop the theory for latent Dirichlet allocation (LDA) (Blei et al., 2003) and later generalize the results to Bayesian networks. To facilitate the transition from LDA to BNs we will treat LDA in a slightly unconventional way by using a single index i that runs over all words in all documents (in contrast to the index ij which is conventional for LDA), see Fig.1. LDA is equivalent to the Bayesian network di zi xi , where nodes xi = w (word-type) and di = j (document label) have been observed. The topic variable is indicated by zi = k . In the following we will also use the indicator variables Xiw = I[xi = w], Dij = I[di = j ] and Zik = I[zi = k ]. 3 Standard Variational Bayes In the variational Bayesian framework we approximate the posterior distribution with a factorized one: p({zi , , }|{xi , di }) k j i Q(:,k |:,k ) Q(:,j |:,j ) Q(zi ) (3) ) (d (z | d ) (x | z) Solving for Q(:,k ) and Q(:,j ), we find that they are Dirichlet as well, with parameters, ¯ wk = + Nwk , ¯ kj = k + Nkj (4) i i ¯ ¯ where Nwk = Qik Xiw , Nkj = Qik Dij and Qik = Q(zi = k ). After plugging these updates back into the update for Q(zi ) we find, Qik ¯ exp( (Nxi k + )) ¯ ¯k + W )) exp( (Nkdi + k )) exp( (N (5) Figure 1: LDA depicted as a standard Bayesian network. N runs over all word-tokens and for each token, both the word-type x and the document label j are observed. The topic variable z is hidden. where (·) = x log (x) is the derivative of the loggamma function. The joint distribution is given as the product of the probability of each variable given its parents, p(xi = w|zi = k ) = wk , p(zi = k |di = j ) = kj and p(di = j ) = j . The last term decouples from the rest of the model because di is always observed. For each of these CPTs we place a Dirichlet prior, :,k D( ) and :,j D(: ). Note that we have chosen a scalar strength parameter (i.e. a symmetric Dirichlet prior) for but vector valued strength parameters : = {k } for . Next we integrate out the CPTs. Since the Dirichlet priors are conjugate to the discrete conditional probabilities of the Bayesian network, this is possible analytically, resulting in the following expression, w (Nwk + ) j kk (Nkj + k ) (1) (Nk + W ) k For future reference we shall now derive the same result in a different manner to which we shall refer as "standard variational Bayes" (SVB) in the following. We first marginalize out the CPTs and consider a variational lower bound on the evidence with a variational distribution Q({zi}) (without assuming factorization). Lemma 1 in the appendix shows that the expression for the log-probability log p({zi , xi}|{di}) is a convex function of {Nwj k}. This allows us to move the average over Q({zi}) inside the expression for the log-probability, at the cost of introducing a further lower bound on the evidence. The resulting expression is now precisely the logarithm of Eqn.1 with the ¯ counts Nwkj replaced by average counts Nwkj , ¯ ^ Nwkj = E[Nwkj ]Q = Nwj Qk|wj in terms of which we define ¯ Nwk = E[Nwk ]Q = ¯ Nkj = E[Nkj ]Q = j w w j (6) p({zi , xi}|{di}) ^ Nwj Qk|wj ^ Nwj Qk|wj ^ Nwj Qk|wj (7) (8) (9) where W is the total number of different word types (the vocabulary size), and the counts are defined as follows: Nwkj = i Xiw Zik Dij , (2) ¯ Nk = E[Nk ]Q = j w w Nwk = Nwkj , Nkj = Nwkj and Nk = j Nw k j . Note that Nwkj are subject to the observational constraints k ^ Nwkj = Nwj given by the training corpus. ^ where Nwj are the observed word-document counts. To understand the definition of Qk|wj we note that Q(zi ) for all i's with the same values of xi = w, di = j are equal, so without loss of generality we can use a single set of parameters Qk|wj = Q(zi = k ) for all data-cases i which share the same observed labels w, j . The final step is to variationally bound this expression once more, using again the fact that the sum of log-gamma factors is a convex function, w ¯ ¯ ¯ ¯ ¯ F (Nwkj ) F (Nwj k )+ wkj F (Nwkj )(Nwj k -Nwkj ) kj Nwk , Nkj ai d Nk , which are sums of assignment variables n Nwkj = Xiw Zik Dij where only Z is random. Also the Rao-Blackwellised estimate of the predictive distribution is a function of these counts, p(x = w|{xi , zi , di }, d = j ) = k Nwk + Nkj + k k Nk + W Nj + k (10) ¯ where Nwkj is held fixed. Recalling definition 6 and taking derivatives w.r.t Qk|wj gives the following update, Qk|wj ¯ exp( (Nwk + )) ¯ ¯k + W )) exp( (Nj k + k )) (11) exp( (N (13) The factorization follows directly from the update and is a result of Jensen's inequality. We can alternatively arrive at Eqn.11 without assuming the second bound of Eqn.10 by assuming that Q({zi}) factorizes and equating its derivatives to 0. However, this does not guarantee convergence as Qk|wj now also appears on the RHS of Eqn.11. Note that Eqn.11 is equivalent but looks subtly different from Eqn.5 in that it avoids updating variational distributions for data-cases i with the same labels w, j . As a result it scales more favorably as the number of unique word-document pairs in the training corpus rather than the total number of word tokens. The central limit theorem tells us that sums of random variables tend to concentrate and behave like a normal distribution under certain conditions. Moreover, the variance/covariance of the predictive distribution is expected to scale with 1/n, where n is the number of data-cases that contribute to the sum. We expect that variational approximations work well for large counts. These insights make it natural to split the dataset into two subsets, one subset S VB to which we will apply the VB approximation and the complement S CG to which we shall apply collapsed Gibbs sampling. In practice we have chosen these sets to be: ^ S VB = {i|Nxi ,di > r}, ^ S CG = {i|Nxi ,di r} (14) 4 Collapsed Gibbs Sampling An alternative to variational inference is collapsed Gibbs sampling where one samples each zi in turn, given the values of the remaining z¬i . The conditional probabilities are easily calculated from Eqn.1, p(zi = k |z¬i , x, d) ¬i (Nwk + ) ¬ (Nj ki + k ) ¬i + W ) (Nk In the experiments below we have chosen r = 1. Although we do not expect that central limit tendencies apply to counts smaller than about 10, we have chosen this extreme setting to convey an important conclusion, namely ^ that the counts with value Nwj = 1 already explain much of the difference between VB and CGS algorithms. We will assume the following factorization for the variational posterior, Q = QVB QCG . Moreover, in the derivation below it will i ollow that QVB becomes factorized as well, f QVB . i.e. QVB = i The evidence of the collapsed distribution under these assumptions reads, E = H(QVB ) + H(QCG )+ z Q(zVB )Q(zCG ) log P (zVB , zCG , x|d) VB ,zCG (12) where the superscript ¬i denotes that data-case i has been removed from the count and we have assumed that xi = w and di = j . Given samples at equilibrium one can obtain unbiased estimates of quantities of interest. The trade-off is that one needs to average over many samples to reduce the effects of sampling noise. Computationally, CGS scales as O(N K ) in time and O(N ) in space, where N is the total number of words in the corpus and K is the number of topics. This in contrast to the SVB updates in the previous section for which both time and space scale as O(M K ) with M the number of unique word-document pairs. In the following section we will derive a principled hybridization of SVB and CGS that can be viewed as a tunable trade-off between bias, variance and computational efficiency. (15) 5 The Hybrid SVB/CGS Algorithm The high level justification for a hybrid algorithm is the intuition that the evidence only depends on count arrays, Analogous to section 3 we apply Jensen's inequality in order to bring the average over QVB inside the convex function log P (zVB , zCG , x), resulting in an expression which we shall denote as log P (QVB , zCG , x|d) for obvious reasons. The log-probability only depends on counts, so by bringing the average inside we find that log P (QVB , zCG , x, d) depends on the quantities, i ^ Vj zik xiw dij (16) E[Nwkj ]QVB = NwB QVBwj + k| S CG ^V where NwB are the observed counts for word-type w and j document j which are in the set S VB . In terms of this we further need, E[Nwk ]QVB = E[Nj k ]QVB = E[Nk ]QVB = j w w j ^ Vj NwB QVBwj + k| ^ Vj NwB QVBwj + k| ^ Vj NwB QVBwj + k| i i S CG zik xiw zik dij (17) (18) (19) i S CG These updates converge in expectation and stochastically maximize the expression for the bound on the evidence. In theory one should draw infinitely many samples from QCG to guarantee convergence. In practice however we have obtained very good results with drawing only a single sample before proceeding to the VB update. It is also possible to infer the hyper-parameters {k , } by either using sampling (Teh et al., 2004) or maximization (Minka, 2000). We refer to those papers for further details. 5.1 Extension To Collapsed Variational LDA In (Teh et al., 2006) an improved variational approximation was proposed that operates in the same collapsed space as collapsed Gibbs sampling. This algorithm uses a faci torized distribution QCVB = Q(zi ) but does not move this inside the log-probability as we did for SVB. Instead, it evaluates the necessary averages in the updates by assuming that the counts behave approximately normal and using a second order Taylor expansion around the mean. It was shown that including this second order information improves the standard VB approximation considerably. This algorithm is also straightforwardly hybridized with collapsed Gibbs sampling by simply replacing the SVB updates with CVB updates in the hybrid SVB/CGS algorithm and using the same definitions for the counts as in Eqn.16. We call this algorithm CVB/CGS. zik S CG These expressions elegantly split the counts into a part described through a non-random mean field plus a sum over the remaining random variables that represent the fluctuations. Thus, after applying Jensen's inequality we end up with the following lower bound on the evidence, B = H(QVB ) + H(QCG )+ z Q(zCG ) log P (QVB , zCG , x|d) E CG (20) Now let's assume we have drawn a sample from QCG , which we will denote with zCG . Furthermore, denote s ¯s with Nwj k the value of E[Nwj k ]QVB evaluated at zCG (see s Eqn.16). Given this sample we bound again through linearization (see Eqn.10) which results in the following update for QVB , QVBwj k| ¯s exp( (Nwk + )) ¯s exp( (Nj k + k )) (21) ¯s exp( (Nk + W )) 6 Extension to Bayesian Networks The extension to collapsed Bayesian networks is relatively straightforward. First let's generalize SVB to Bayes nets. The derivation which includes variational distributions for the CPTs can be found in (Beal, 2003). We follow an alternative derivation that facilitates the transition to the hybrid SVB/CGS algorithm. We first collapse the state space by marginalizing out the CPTs. This results in an expression for the evidence that consists of products of factors, where each factor is a ratio of gamma-functions and the factors follow the structure of the original Bayes net. For instance, consider a hidden variable z which can assume state values k with two parents u, v which take values l, m respectively, see Fig.2. The factor associated with this family that will appear in the joint collapsed probability distribution of the BN is given by T F ({zi , ui , vi }) = k l k ( k ) k (Nlm + k ) m In case we decide to use more than 1 sample we replace the expressions (·) (·) , where the brackets · denote taking the sample average. The update for QCG will be sample-based. We first compute the variational update for QCG , QCG P (QVB , zCG , x|d) (22) and subsequently draw samples from it. On closer inspection we see that this distribution is identical to the collapsed distribution p(x, z|d), but over fewer data-cases, namely those in the set S CG , and with new effective hyperparameters given by, w ^ Vj NwB QVBwj (23) j k = k + k| wk = + j ^ Vj NwB QVBwj k| (24) (26) (Nklm + k ) (k ) Hence, we can apply standard collapsed Gibbs sampling to draw from QCG , p(zi CG ¯ ¬i (Nwk,s + ) ¯ ¬i,s = k |z¬i , x, d) (Nj k + k ) ¯¬ (Nk i,s + W ) (25) CG he complete joint (collapsed) probability distribution is given by a product of such factors, one for each family in the BN. This implies that the collapsed distribution inherits the graphical structure of the original BN, in particular its ) ( z | u, v alternate with message passik g. In this interpretation, one n ^ could view the relations lm Nklm|w.. = Nw.. (which are equivalent to normalization of the Qklm|w.. ) to be the constraints. An alternative to VB is collapsed Gibbs sampling. Here one updates all hidden variables for a single data-item. One first removes the data-case from the pool and computes the expression for p(zi |z¬i , x, d) over all hidden variables z in the Bayes net. This expression also factorizes according to the structure of the BN but the factors are now given by, Fk l m = ¤ ¬ Nkli + k m k ¬i + Nlm k Figure 2: Family inside a BN consisting of two parents and one child. treewidth, implying that collapsed inference using SVB is equally expensive as ordinary inference (for given CPTs) in the original BN. Next, we move the average over the variational distribution Q({zi }) inside the gamma-factors. This again produces a bound on the evidence. We then linearize the log-gamma factors around the current value of Q. Ignoring constant factors this results in the following terms for the family above, G(Q, Q ) = terms for other families+ (27) ¯ k k ¯ ¯ (Nklm + k ) - (Nlm + k ) Nk l m lm w^ ¯ We recall again that Nklm = .. Nw.. Qklm|w.. where w.. represents all observed labels and Qklm|w.. is the posterior marginal distribution over the variables z , u, v given observations. All of these factors are local in the cliques of the original Bayes net. Hence, the update for Q becomes proportional to the product of local factors of the form, Fklm = ¯ exp( (Nklm + k )) ¯lm + k k )) exp( (N (28) As a result we can recompute new values for the local posterior marginals by running belief propagation over the junction tree associated with the original BN. Since Qklm|w.. depends on w.., this has to be done for every combination of observed labels that has been observed at least ^ once in the data (i.e. Nw.. > 0). It's interesting that the algorithm can thus be interpreted as an iterative belief propagation algorithm on a temporary graphical model where the potentials change from one iteration to the next. It bears a strong resemblance with iterative proportional fitting in which scaling updates enforce the constraints and ¢ £ ¡ (29) One can draw samples by starting out at the leafs of the junction tree and computing distributions for the current node conditioned on upstream variables but marginalizing over all downstream variables. Given these variables we can then run an ancestral sampling pass outwards, back to the leaf nodes. This algorithm is an extension of the forward-filtering-backwards-sampling (FFBS) algorithm proposed in (Scott, 2002) for HMMs. The derivation for the hybrid algorithm goes along similar lines as for LDA. We split all data-cases into two subsets, S VB and S CG . We use Jensen's inequality to move the variational distribution QVB inside the log-gamma functions. Finally, we derive updates for QVB through the linearization trick. The final algorithm thus rotates over all data-cases, running either BP on the associated junction tree if the datacase is in S VB (updating the QVB for all families of the BNs) or the FFBS algorithm if the data-case is in S CG . The expressions for the counts are always the analogs of those in Eqn.16. 7 Experiments We report results on two datasets: 1) "KOS", which is harvested from a lefty blog site "www.dailykos.com"1 and 2) "NIPS" which is a corpus of 2,484 scientific papers from the proceedings of NIPS2 . KOS has J = 3430 documents, a vocabulary size of W = 6909, a total of N = 467714 words, and M = 360664 unique word-document pairs. NIPS has J = 1740, W = 12419, N = 2166029 and M = 836644. We used K = 10 for KOS and K = 40 for NIPS and set k = = .1 for both datasets. In all sets of experiments, we withheld a random 10% of the words in the corpus for testing, while training on the remaining 90%. We compared a variety of algorithms: collapsed Gibbs sampling (CGS), standard VB (SVB), collapsed VB (CVB), as well as two hybrid algorithms: a hybrid of standard VB and CGS (SVB/CGS) as described in Downloadable from http://yarra.ics.uci.edu/kos/. Thanks to Dave Newman for pointing us to this site. 2 Originally from http://books.nips.cc and preprocessed by Sam Roweis and Dave Newman. 1 Test perplexity Section 5, and a hybrid of collapsed VB (CVB/CGS) as described in Section 5.1. For both hybrid algorithms we only ^ sampled data-cases for which Nwj = 1. For all algorithms we trained for 300 iterations, testing after every iteration. In the figure captions we report the number of runs over which we averaged the results. The algorithms were tested using the standard measure of individual word perplexity on the withheld test set. For the pure variational algorithms this is: p({xtiest}|{dtiest}) = i k ¯ ¯ k + Nkdtiest + Nxtiest k k ¯ ¯ (30) k + Ndtiest W + Nk 2100 2050 2000 1950 1900 1850 1800 1750 1700 1650 1600 SVB CVB SVB/CGS CVB/CGS CGS Figure 4: Final perplexities of algorithms at iteration 300 for For CGS and the hybrid algorithms, we perform an online average over the samples after every iteration of sampling, discarding an initial burn in phase of 10 iterations, p({xtiest}|{dtiest}) = i ¯s ¯s k 1 s S k + Nkdtest + Nxtest k i i k ¯s ¯s S =1 k + Ndtest W + Nk i KOS. Results averaged over 20 runs. SVB CVB SVB/CGS CVB/CGS CGS 2600 2400 Test Perplexity 2200 2000 1800 1600 1400 0 50 100 150 Iteration 200 (31) The results for KOS and NIPS are shown in Figures 3, 4, 5 and 6. The variational algorithms converged faster than CGS or the hybrid algorithms, but converged to suboptimal points. Collapsed algorithms performed better than the standard counterparts. The hybrid algorithms significantly improved upon the corresponding pure variational algorithms, with the performance of CVB/CGS being basically on par with CGS. To study how these results deSVB CVB SVB/CGS CVB/CGS CGS 250 300 Figure 5: Perplexities of algorithms as function of number of iterations for NIPS. Results averaged over 17 runs. 1700 1650 1600 1550 1500 1450 1400 2600 2400 Test Perplexity 2200 2000 1800 1600 0 50 100 150 Iteration 200 250 300 Test perplexity SVB CVB SVB/CGS CVB/CGS CGS Figure 3: Perplexities of algorithms as function of number of iterations for KOS. Results averaged over 20 runs. The lines for CVB/CGS and CGS are on top of each other. Figure 6: Final perplexities of algorithms at iteration 300 for NIPS. Results averaged over 17 runs. pend on the vocabulary size, we first ordered all the wordtypes according to their total number of occurrences and then only retained the top 3000 most frequent words for KOS and the top 6000 most frequent words for NIPS. The results are shown in Figures 7 and 8. Similar results were obtained with reduced vocabulary sizes of 4000 for KOS and 4000 and 8000 for NIPS. We conclude that for both datasets the perplexities have dropped significantly implying that prediction has become easier. However, the relative performance of the hybrid algorithms has not significantly changed. To understand how much the algorithms learned from the ^ singleton counts, Nwj = 1, we first removed them from the training set but not from test set and subsequently re- 5000 1150 4800 4600 Test perplexity Test perplexity SVB CVB SVB/CGS CVB/CGS CGS 1100 4400 4200 4000 3800 3600 1000 3400 3200 950 3000 SVB CVB SVB/CGS CVB/CGS CGS 1050 Figure 7: Final perplexities of algorithms at iteration 300 for KOS with a reduced vocabulary size of 3000 word types. Results averaged over 14 runs. 1280 1260 1240 Figure 9: Final perplexities of algorithms at iteration 300 for KOS with all singleton counts removed from only training set. Results averaged over 20 runs. 420 400 380 Test perplexity Test perplexity SVB CVB SVB/CGS CVB/CGS CGS 1220 1200 1180 1160 1140 360 340 320 300 280 1120 260 1100 SVB CVB SVB/CGS CVB/CGS CGS Figure 8: Final perplexities of algorithms at iteration 300 for NIPS with a reduced vocabulary size of 6000. Results averaged over 4 runs. Figure 10: Final perplexities of algorithms at iteration 300 for KOS with all singleton counts removed from both training and test sets. Results averaged over 12 runs. count ratios necessary for CGS. For datasets which have relatively more large counts we expect the VB and hybrid algorithms to be significantly faster than CGS because they require only a single update per nonzero word-document ^ entry rather then Nwj CGS updates for that entry. moved them from both training and test sets. The results for KOS are shown in Figures 9 and 10 (for a full vocabulary). In this case the difference between hybrid and VB algorithms is only due to the fact that we are still performing an online average for hybrid algorithms, even though we are not sampling any data-cases. We observe that the impact on the absolute values of the perplexities is very large indeed. The results for NIPS were qualitatively similar, but were much more moderate due to the fact that KOS contains many more very small counts than NIPS. We thus conclude that singleton counts play a very important, and sometimes even dominant role for text data in terms of perplexity. Whether this conclusion holds true for actual applications of LDA, such as indexing a corpus or retrieving similar documents to a test document remains to be seen. For text data, where small counts are so dominant VB algorithms are not significantly faster than CGS. This is mainly due to the fact that the VB algorithms must compute3 exp( (·)) which is more expensive than the simple 3 In fact, one could probably speedup the computation by noticing that exp( (·)) is almost linear for arguments larger than 8 Discussion and Related Work In the context of topic models and Bayesian networks, we present a novel hybrid inference algorithm that combines variational inference with Gibbs sampling in a collapsed state space where the parameters have been marginalized out. We split the data-cases into two sets, S CG , S VB , where data-cases in the set S CG are handled with Gibbs sampling while data-cases in the set S VB are handled with variational updates. These updates interact in a consistent manner in that they stochastically optimize a bound on the evidence. In this paper we have restricted attention to discrete models, but extensions to the exponential family seem feasible. 5, but we haven't pursued this further. The algorithm has the same flavor as stochastic EM algorithms where the E-step is replaced with a sample from the posterior distribution. Similarly to SEM (Celeux & Diebolt, 1985), the sequence of updates for QVB and zt t for the proposed algorithm forms a Markov chain with a unique equilibrium distribution, so convergence is guaranteed. How different this approximate equilibrium distribution is from the true equilibrium distribution remains to be studied. The algorithm is also reminiscent of cutset sampling (Bidyuk & Dechter, 2007) where a subset of the nodes of a Bayesian network are sampled while the remainder is handled using belief propagation. It suggest an interesting extension of the proposed algorithm where one specifies a division of both data-cases and nodes into subsets S CG and S VB . The nodes in S VB should form a forest of low-treewidth junction trees for the algorithm to remain tractable. Alternatively, if the treewidth is too large, one can use loopy belief propagation on the set S VB . Our current choice for S CG and S VB was quite naive and motivated by our interest to understand why variational algorithm perform poorly for LDA. However, it seems preferable to develop more principled and perhaps adaptive (online) methods to make this division. Still other hybridizations between variational and MCMC inference exist (de Freitas et al., 2001; Carbonetto & de Freitas, 2006) where variational distributions are used to guide and improve sampling. However, these algorithms solve a different problem and do not operate in collapsed state spaces. To conclude, we believe that hybrid algorithms between the two major classes of inference schemes, namely variational and sampling, are a fruitful road to trade off bias, variance, computational efficiency and statistical accuracy. Proof Lemma-1: We can write z(x) as a log-partition function as follows, k d k x -1 (xk ) k k Z (x) = exp(z (x)) = p1 p2 ..pK pk = ( xk ) (33) It follows that the Hessian is equal to the covariance of {log pk } and hence positive definite. References Beal, M. (2003). Variational algorithms for approximate bayesian inference (Technical Report). Gatsby Computational Neuroscience Unit, University College London, London, UK. PhD. Thesis. Bidyuk, B., & Dechter, R. (2007). Cutset sampling for bayesian networks. Journal Artificial Intelligence Research, 28, 1­48. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993­ 1022. Buntine, W. (Ed.). (2002). Variational extensions to em and multinomial pca, vol. 2430 of Lecture Notes in Computer Science. Helsinki, Finland: Springer. Carbonetto, P., & de Freitas, N. (2006). Conditional mean field. NIPS. Castella, G., & Robert, C. (1996). Rao-blackwellisation of sampling schemes. Biometrika, 83(1), 81­94. Celeux, G., & Diebolt, J. (1985). The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comp. Statis. Quaterly, 2, 73­82. de Freitas, N., d. F. R. Hojen-Sorensen, P. A., & Russell, S. J. (2001). Variational mcmc. UAI '01: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence (pp. 120­ 127). Griffiths, T., & Steyvers, M. (2002). A probabilistic approach to semantic representation. Proceedings of the 24th Annual Conference of the Cognitive Science Society. Heckerman, D. (1999). A tutorial on learning with bayesian networks. 301­354. Acknowledgements Max Welling was supported by NSF grants IIS-0535278 and IIS0447903 and by an NWO (Dutch Science Foundation) "Bezoekersbeurs". The research reported here is part of the Interactive Collaborative Information Systems (ICIS) project, supported by the Dutch Ministry of Economic Affairs, grant BSIK03024. Minka, T. (2000). Estimating a dirichlet distribution (Technical Report). Scott, S. L. (2002). Bayesian methods for hidden Markov models, recursive computing in the 21st century (pp. 337­351. ). Teh, Y., Jordan, M., Beal, M., & Blei, D. (2004). Hierarchical Dirichlet processes. NIPS. A Lemma-1 Lemma-1: The following function is convex as a function of x, z (x) = k log (xk ) - log ( k xk ) (32) Teh, Y., Newman, D., & Welling, M. (2006). A collapsed variational bayesian inference algorithm for latent dirichlet allocation. NIPS. Teh, Y. W., Kurihara, K., & Welling, M. (2008). Collapsed variational inference for HDP. Advances in Neural Information Processing Systems.