The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling Sinead Williamson Department of Engineering, University of Cambridge, Trumpington Street, Cambridge, UK SAW 56@ CAM . AC . UK Chong Wang CHONGW @ CS . PRINCETON . EDU Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ 08540, USA Katherine A. Heller HELLER @ GATSBY. UCL . AC . UK Department of Engineering, University of Cambridge, Trumpington Street, Cambridge, UK David M. Blei BLEI @ CS . PRINCETON . EDU Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ 08540, USA Abstract The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric mixed membership model--each data point is modeled with a collection of components of different proportions. Though powerful, the HDP makes an assumption that the probability of a component being exhibited by a data point is positively correlated with its proportion within that data point. This might be an undesirable assumption. For example, in topic modeling, a topic (component) might be rare throughout the corpus but dominant within those documents (data points) where it occurs. We develop the IBP compound Dirichlet process (ICD), a Bayesian nonparametric prior that decouples across-data prevalence and within-data proportion in a mixed membership model. The ICD combines properties from the HDP and the Indian buffet process (IBP), a Bayesian nonparametric prior on binary matrices. The ICD assigns a subset of the shared mixture components to each data point. This subset, the data point's "focus", is determined independently from the amount that each of its components contribute. We develop an ICD mixture model for text, the focused topic model (FTM), and show superior performance over the HDP-based topic model. 1. Introduction Finite mixture models are widely used for clustering data (McLachlan & Peel, 2000). When fit to data, the components of a mixture model reflect similarity patterns, and each data point is probabilistically assigned to one of the components. When data are groups, i.e., each data point is a collection of observations, then mixed membership models are appropriate. Mixed membership models are a hierarchical variant of finite mixtures for grouped data where each data point exhibits multiple components. The components are shared across all data, and each data point exhibits them with different proportions. Mixed membership models are an effective tool for capturing complex data heterogeneity (Erosheva et al., 2004). Mixed membership models, like finite mixtures, require an a priori choice of the number of components. To address this issue, Teh et al. (2006) developed the hierarchical Dirichlet process (HDP), a Bayesian nonparametric mixed membership model. The HDP allows for a potentially infinite number of components a priori so that, when conditioned on data, its posterior places a distribution over how many are exhibited. The HDP provides more flexible mixed membership modeling, avoiding costly model comparisons in order to determine an appropriate number of components. However, the HDP makes a hidden assumption: A component's overall prevalence across data is positively correlated with the component's average proportion within data. The reason for this is that the HDP centers the random component proportions for each data point around the same global proportions. This assumption may not be sensible. Consider modeling Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Indian Buffet Process Compound Dirichlet Process a text corpus with an HDP. This is called "topic modeling" because the posterior components (called "topics") tend to reflect the semantic themes of the documents (Blei et al., 2003). The HDP assumption is that a frequent topic will, on average, occur frequently within each document. However, there is no reason to correlate the number of articles on a topic, such as baseball, with how much that topic contributes to any particular article. Baseball may be a rare topic, but articles about baseball often devote themselves exclusively to it. In this paper, we build a Bayesian nonparametric mixed membership model that allays this assumption. Our model decorrelates prevalence and proportion, allowing rarely seen components to occur with high proportion and frequently seen components to occur with low proportion. We develop the IBP compound Dirichlet process (ICD), a Bayesian nonparametric mixed membership model that decorrelates across-data prevalence and within-data proportion. The ICD uses a random binary matrix drawn from the Indian buffet process (IBP, Griffiths & Ghahramani, 2005) to select which components are used in each data point (across-data prevalence), and an infinite series of gamma random variables to model how much they are used (withindata proportions). We use the ICD in the focused topic model (FTM), a generative model of document collections. The central challenge in using the ICD is posterior inference. Sampling the IBP-distributed binary matrix directly leads to slow convergence, but integrating it out exactly is intractable due to the infinite combinatorial space of latent variables. We present an approximation to this integral, based on a technique used in Wang & Blei (2009), and use this approximation to develop an efficient collapsed Gibbs sampler. We compare the FTM to the HDP topic model on three text corpora. We see that the FTM reduces the correlation between across-data prevalence and within-data proportion, which allows for a more compact representation of the data than the HDP provides. As a consequence, the FTM obtains a better fit to language and achieves substantially better perplexity on held out data. 2.1. Hierarchical Dirichlet Processes The hierarchical Dirichlet process (HDP, Teh et al., 2006) is a prior appropriate for Bayesian nonparametric mixed membership modeling. In an HDP, each data point is associated with a draw from a Dirichlet process (DP), which determines how much each member of a shared set of mixture components contributes to that data point. The base measure of this data-level DP is itself drawn from a DP, which ensures that there is a single discrete set of components shared across the data. More precisely, the generative process for the per-data distribution Gm is: G0 DP(, H), Gm DP(, G0 ) for each m. Each distribution Gm is a sample from a DP with concentration parameter and base probability measure G0 . This base measure G0 is itself a draw from a DP, with concentration parameter and base measure H. The base measure G0 is thus discrete and, consequently, the per-data distributions Gm are also discrete with common support determined by the locations of the atoms of G0 . Each atom represents a component and is described by a location, a weight in Gm , and a weight in G0 . The location is identical in both G0 and Gm ; it gives the parameters associated with the component, e.g., a Gaussian mean or a distribution over terms. The weight in Gm gives the proportion for that component in the mth data point. The weight of a component in Gm is drawn from a distribution centered around the corresponding weight in G0 . Thus, the weight for any given component is drawn from the same distribution across all the data, and that distribution controls both how prevalent the component is and its proportion within each data point. For example, if a component has low weight in G0 then it will also have low weight in most Gm . That component is unlikely to contribute to data points and, when it does, that contribution will be very small. As mentioned in the introduction, this is not necessarily a desirable modeling assumption. Rather than control these two properties via a single variable, as is the case in the HDP, we wish to model them separately. We develop a model where an infrequently occurring component can still have high proportion when it does occur, and vice versa. 2.2. Indian Buffet Process Our model uses the Indian buffet process (IBP, Griffiths & Ghahramani, 2005) to control component occurrence separately from component proportion. The IBP defines a distribution over binary matrices with an infinite number of columns, only a finite number of which contain non-zero entries. It can be derived by taking the limit as K 2. IBP Compound Dirichlet Process In this section we present the IBP compound Dirichlet process (ICD), a Bayesian nonparametric prior which addresses the limitations of the hierarchical Dirichlet process (HDP). We develop the focused topic model (FTM), an application of the ICD to document analysis. We assume the reader is familiar with the Dirichlet process (DP); for a review, see Ghosal (2010). Indian Buffet Process Compound Dirichlet Process of a finite M × K binary matrix B, with elements bmk distributed according to, k Beta(/K, 1), bmk Bernoulli(k ) for each m, where the mth row of B is bm , the kth cell of bm is bmk , and k is the probability of observing a non-zero value in column k. As K tends to infinity, we can obtain a strictly decreasing ordering of the latent probabilities k by starting with a "stick" of unit length and recursively breaking it at a point Beta(, 1) along its length, discarding the excess (Teh et al., 2007), for k = 1, 2, . . . : µk Beta(, 1), k = k j=1 atom masses which can lead to difficulties in developing inference schemes. While we wish to ensure that the base measure of the lower level DP is still normalizable (i.e. is drawn from a convergent process), we do not need to rely on the slab to enforce this constraint. Since the IBP selects a finite number of components for each data point, it is sufficient merely to ensure that the sum of any finite subset of top-level atoms is finite. Thus, rather than drawing the atoms of the slab from a DP, we sample their masses as independent gamma random variables. This eliminates the restrictions on component proportions imposed by a DP and, since the resulting component proportions are independent, makes inference much easier. The model assumes the following generative process, 1. for k = 1, 2, . . . , (a) Sample the stick length k according to Eq. 1. (b) Sample the relative mass k Gamma(, 1). (c) Sample the atom location k H. 2. for m = 1, . . . , M , (a) Sample a binary vector bm according to Eq. 1. (b) Sample the lower level DP, P kb Gm DP( k bmk k , P mk k k k ). bmk k µj , for each m. (1) bmk Bernoulli(k ) In our model, the rows of the IBP matrix represent data points, the columns represent components, and the cells indicate which components contribute to which data points. 2.3. IBP compound Dirichlet process We now develop a prior over a set of discrete probability distributions that decorrelates which components occur and in what proportion. Rather than assigning positive probability mass to all components for every data point, as in the HDP, our model assigns positive probability to only a subset of components, selected independently of their masses. The IBP provides a method for selecting subsets from a countably infinite set of components. Thus, one way of achieving our goal is to introduce the IBP directly into the HDP, using the mth row of the IBP to determine a subset of the infinite set of atoms present in the top level DP sample G0 . This defines an (unnormalized) measure that can be used as the base measure for a data-specific DP. This can be seen as an infinite spike and slab model. Spike and slab models describe a mixture model between a continuous distribution (the "slab")1 and the measure degenerate at zero. A "spike" distribution determines which variables are drawn from the slab, and which are zero. In the model above, the spikes are provided by the IBP, and the slab is provided by the top level DP. However, better choices for the top-level "slab" distribution are available. Draws from the DP "slab" are constrained to sum to one, which restricts the distribution over component proportions and introduces dependencies between In its original form, the slab was a uniform distribution. However, the concept and terminology have also been employed in models where the slab is not the uniform distribution ­ see for example (Ishwaran & Rao, 2005) ­ and it is in this more general sense that we use the term. 1 In sampling the lower level DP, masses are assigned to the atoms k independent of their locations. Since the number of locations selected by the binary vector bm is finite almost surely, these masses can be sampled from a Dirichlet distribution defined over the selected k : m Dirichlet(b · ) Gm = k mk k , where b · is the Hadamard product of b and . If we marginalize out the sparse binary matrix B and the gamma random variables k , the atom masses are distributed according to a mixture of Dirichlet distributions governed by the IBP: p( m |, ) = where, B IBP(), k Gamma(, 1), m Dirichlet(bm · ). We call this model the IBP compound Dirichlet process (ICD), since the IBP provides the mixing measure for a mixture of Dirichlet distributions. Like the HDP, this model is a form of dependent Dirichlet process (MacEachern, 2000). The ICD achieves our goal of decoupling how often the components occur and in what proportion. The IBP draw d B p( m |B, )p(B|)p(|), (2) Indian Buffet Process Compound Dirichlet Process B selects a subset of atoms for each distribution, and the gamma random variables determine the relative masses associated with these atoms. 2.4. Focused Topic Models b z w n. (m) M Suppose H parametrizes distributions over words. Then, the ICD defines a generative topic model, where it is used to generate a set of sparse distributions over an infinite number of components, called "topics." Each topic is drawn from a Dirichlet distribution over words. In order to specify a fully generative model, we sample the number of words for each document from a negative binomial distribution, (m) n· NB( k bmk k , 1/2).2 The generative model for M documents is 1. for k = 1, 2, . . . , (a) Sample the stick length k according to Eq. 1. (b) Sample the relative mass k Gamma(, 1). (c) Draw the topic distribution over words, k Dirichlet(). 2. for m = 1, . . . , M , (a) Sample a binary vector bm according to Eq. 1. (b) Draw the total number of words, (m) n· NB( k bmk k , 1/2). (c) Sample the distribution over topics, m Dirichlet(bm · ). (m) (d) For each word wmi , i = 1, . . . , n· , i. Draw the topic index zmi Discrete( m ). ii. Draw the word wmi Discrete( zm ). i Figure 1. Graphical model for the focused topic model. 3. Related Models Titsias (2007) introduced the infinite gamma-Poisson process, a distribution over unbounded matrices of nonnegative integers, and used it as the basis for a topic model of images. In this model, the distribution over features for the mth image is given by a Dirichlet distribution over the non-negative elements of the mth row of the infinite gamma-Poisson process matrix, with parameters proportional to the values at these elements. While this results in a sparse matrix of distributions, the number of zero entries in any column of the matrix is correlated with the values of the non-zero entries. Columns which have entries with large values will not typically be sparse. Therefore, this model will not decouple across-data prevalence and withindata proportions of topics. In the ICD the number of zero entries is controlled by a separate process, the IBP, from the values of the non-zero entries, which are controlled by the gamma random variables. The sparse topic model (SparseTM, Wang & Blei, 2009) uses a finite spike and slab model to ensure that each topic is represented by a sparse distribution over words. The spikes are generated by Bernoulli draws with a single topicwide parameter. The topic distribution is then drawn from a symmetric Dirichlet distribution defined over these spikes. The ICD also uses a spike and slab approach, but allows an unbounded number of "spikes" (due to the IBP) and a more globally informative "slab" (due to the shared gamma random variables). We extend the SparseTM's approximation of the expectation of a finite mixture of Dirichlet distributions, to approximate the more complicated mixture of Dirichlet distributions given in Eq. 2. Recent work by Fox et al. (2009) uses draws from an IBP to select subsets of an infinite set of states, to model multiple dynamic systems with shared states. (A state in the dynamic system is like a component in a mixed membership model.) The probability of transitioning from the ith state to the jth state in the mth dynamic system is drawn from a Dirichlet distribution with parameters bmj + i,j , where and are constant. This model does not allow sharing We call this the focused topic model (FTM) because the infinite binary matrix B serves to focus the distribution over topics onto a finite subset (see Figure 1). The number of topics within a single document is almost surely finite, though the total number of topics is unbounded. The topic distribution for the mth document, m , is drawn from a Dirichlet distribution over the topics selected by bm . The Dirichlet distribution models uncertainty about topic proportions while maintaining the restriction to a sparse set of topics. The ICD models the distribution over the global topic proportion parameters separately from the distribution over the binary matrix B. This captures the idea that a topic may appear infrequently in a corpus, but make up a high proportion of those documents in which it occurs. Conversely, a topic may appear frequently in a corpus, but only with low proportion. 2 Notation nk is the number of words assigned to the kth topic of the mth document, and we use a dot notation to represent P (m) (m) summation - i.e. n· = k nk . (m) Indian Buffet Process Compound Dirichlet Process of information about the within-data probability of a state between data points, which is modeled in the ICD via the gamma-distributed k . An alternative inference scheme is also used, where the IBP matrix is sampled instead of being integrated out. A number of other models have been proposed to address the rigid topic correlation structure assumed in the LDA and HDP topic models, including the correlated topic model (CTM, Blei & Lafferty, 2005) and the pachinko allocation model (PAM, Li & McCallum, 2006). Our aim is different. The FTM reduces undesirable correlations between the prevalence of a topic across the corpus and its proportion within any particular document, rather than adding new correlations between topics. Correlations among topics could be integrated into the FTM in future work. ters at these points. If we integrate out the sparse binary vector bm , rather than sampling m from a single Dirichlet distribution, we must sample it from an infinite mixture of Dirichlet distributions, with the IBP determining the mixing proportions: p( m |z¬mi , ) d bm Dirichlet( m |(n¬i + ) · bm ) (m) (m) p(bm , , n· (m) |· , · , , ), (3) 4. Posterior Inference We use Gibbs sampling for posterior inference over the latent variables. The algorithm cyclically samples the value for a single variable from its conditional distribution given the remaining variables. To improve mixing time, we use a collapsed Gibbs sampler, integrating out the topic-specific word distributions , the topic mixture distributions , and the sparsity pattern B. We use an approximate method, described in the appendix, to integrate out the infinitedimensional sparsity pattern B. We sample only the global topic proportion variables , the global topic sparsity probability variables , and the topic assignments z. 4.1. Sampling z The conditional distribution of the topic assignment of the ith word in the mth document depends on the posterior distribution of the topic proportion m for that document: p(zmi = k|z¬mi , wmi , w¬mi , ) p(wmi |zmi = k, z¬mi , w¬mi )p(zmi = k|z¬mi , ) mi (nk,¬i + ) where n¬i is the topic assignment statistic excluding word wmi . However, we cannot integrate out the sparse binary vector bm exactly due to its combinatorial nature. Fortunately, we only ever use the expected values of m given z¬mi and , since d m p(zmi = k| m )p( m |z¬mi , ) = E[mk |z¬mi , ] (from Eq. 3). This expectation can be efficiently approximated via a procedure detailed in the appendix. 4.2. Sampling and To sample and , we re-instantiate the binary matrix B as an auxiliary variable, and iteratively sample B, and . (·) We categorize the columns of B as "active" if nk > 0, and "inactive" otherwise. The total number of words in the mth document assigned to the kth topic is distributed according to NB(bmk k , 1/2). The joint probability of k and the total number of words assigned to the kth topic is given by, p(k , nk |bk , ) = p(k |) = -1 e-k k () (·) M m=1 p(nk |bmk , k ) (m) (m) (k +nk ) m:bmk =1 ( )n(m) !2k +nm k k k . (4) (w ) This is log differentiable with respect to k and . Thus we use Hybrid Monte Carlo (MacKay, 2002) to sample from the posteriors of k and . To sample the k , we follow a similar approach to the semiordering stick-breaking scheme of (Teh et al., 2007). The "active" features are distributed according to: p(k |B) Beta M m=1 bmk , 1+M - M m=1 bmk d m p(zmi = k| m )p( m |z¬mi , ), (m) (w) where = { · , · , n· , , }, and nk is the number of times word w has been assigned to topic k in the vector of assignments z. We use · , · to represent those elements of and associated with topics which are currently represented in the corpus, and , to represent the remaining elements, which are associated with unused topics and whose values are therefore unknown. Conditioned on the sparse binary vector bm and the gamma random variables , the topic mixture distribution, m , is distributed according to a Dirichlet distribution. The sparse vector bm determines the subset of topics over which the Dirichlet distribution is defined, and the gamma random variables determine the values of the Dirichlet parame- (5) and the "inactive" features are strictly ordered as suggested by Eq. 1. (Note that the definition given here of "active" and "inactive" features differs slightly from that given (·) in Teh et al. (2007), as we consider a feature where nk = 0 to be inactive, and therefore subject to strict ordering, regardless of whether m bmk > 0.) Since the binary matrix B is discarded after this step, and we only use the active k to sample the topic allocations, Indian Buffet Process Compound Dirichlet Process (a) 20 Newsgroups 400 350 300 250 200 150 q q q q q q q q q q q q q q q q q q q q (b) 60 FTM HDP 50 FTM HDP PNAS q q q q q Reuters 500 400 Number of words Number of topics 0 5 10 15 20 25 30 35 Held out perplexity 40 30 20 10 300 200 100 0.95 0.90 0.85 0.80 0.75 FTM HDP FTM HDP FTM HDP Figure 2. Experimental comparison between FTM (dark blue) and HDP (pale blue) on three datasets. Each point represents the result on one fold, and is computed with the other folds as training data. Dashed lines connect the results from the same fold. Top. Test set perplexities. Lower numbers indicate better performance. Bottom. Correlation between topic presence frequency and topic proportion. The FTM reduces the correlation between them. we have no interest in the inactive features and only need to sample the active elements of B and . The elements of B are sampled according to the following probabilities: p(bmk |k , k , nk ) = (m) bmk if nk > 0 k (6) (m) 2 (1-k ) if bmk = 0 and nk = 0 k +2k (1-k ) (m) k if bmk = 1 and nk = 0. +2k (1- ) k k Prop/prev correlation 0 0 0 50 100 150 200 Number of topics Number of documents Figure 3. (a) Histogram of the number of topics a word appears in for the FTM (dark blue, left) and the HDP (pale blue, right) models on 20 Newsgroups data. In the FTM, words are generally associated with fewer topics. (b) Histogram of the number of documents a topic appears in for the FTM (dark blue, left) and HDP (pale blue, right) models on 20 Newsgroups data. In both models, topics appearing in more than 200 documents have been excluded to focus on the low frequency topics. The HDP has many more topics which only appear in a very few documents. (m) In both the FTM and HDP topic models, we fixed the topic distribution hyper-parameter to 0.1. Following Teh et al. (2006), we used priors of Gamma(5, 0.1) and Gamma(0.1, 0.1) in the HDP. In the FTM, we used prior Gamma(5, 0.1), and we fixed = 5. First, we examined test-set perplexity, a measure of how well the models generalize to new data. Figure 2 (top) shows test set perplexities obtained on each dataset for the two models, using 5-fold cross-validation. To obtain these measurements, we ran both Gibbs samplers for 1000 iterations, discarding the first 500. In each case, the FTM achieves better (lower) perplexity on the held out data. We developed the FTM to decorrelate the probability of a topic being active within a document and its proportion within the documents attributed to it. To consider whether this is observed in the posterior, we next compared two statistics for each topic found. The topic presence frequency for a given topic is the fraction of the documents within the corpus that contain at least one incidence of that topic. The topic proportion for a topic is the fraction of the words within the corpus attributed to that topic. The correlation between topic presence frequencies and topic proportions is shown in Figure 2 (bottom). In each case we see lower correlation for the FTM. In fact, the dataset which exhibits the greatest improvement in heldout perplexity under the FTM, also exhibits the greatest decrease in correlation between frequency and proportion. In our study, we observed that the FTM posterior prefers a more compact representation. It contains fewer topics overall than the HDP, but more topics per document. While the HDP uses a larger number of topics to model the corpus than the FTM, most of these topics appear in only a handful of documents. In addition, individual words in the With equations 3, 4, 5 and 6 we have specified the full Gibbs sampler for the FTM model3 . From the states of this sampler, we can compute topics, topic proportions, and sparsity patterns. 5. Empirical study We compared the performance of the FTM to the HDP with three datasets: · PNAS: This is a collection of 1766 abstracts from the Proceedings of the National Academy of Sciences (PNAS) from between 1991 and 2001. The vocabulary contains 2452 words. · 20 Newsgroups: This is a collection of 1000 randomly selected articles from the 20 newsgroups dataset.4 The vocabulary contains 1407 words. · Reuters-21578: This is a collection of 2000 randomly selected documents from the Reuters-21578 dataset.5 The vocabulary contains 1472 words. For each dataset, the vocabulary excluded stop-words and words occurring in fewer than 5 documents. 3 4 Matlab code is available from the authors http://people.csail.mit.edu/jrennie/20Newsgroups/ 5 http://kdd.ics.uci.edu/databases/reuters21578/ Indian Buffet Process Compound Dirichlet Process vocabulary are, on average, associated with more topics under the HDP, meaning that FTM topics are generally more distinct. These findings are illustrated in Figure 3. (These results are for 20 Newsgroups. Other data exhibited similar properties.) Finally, we note that the computational complexity of both models grows with the number of topics represented. When the number of topics is equal, a single iteration of the FTM algorithm is more costly than a single iteration of the HDP. However, the more compact representation of the FTM yields similar runtime. In the data we analyzed, the FTM analysis was as fast as the HDP analysis. References Blei, D. M. and Lafferty, J. Correlated topic models. In NIPS, volume 18, 2005. Blei, D. M., Jordan, M. I., and Ng, A. Y. Latent Dirichlet allocation. JMLR, pp. 993­1022, 2003. Erosheva, E., Fienberg, S., and Lafferty, J. Mixedmembership models of scientific publications. PNAS, 101(Suppl 1):5220­5227, April 2004. Fox, E. B., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. Sharing features among dynamical systems with beta processes. In NIPS, volume 22, 2009. Ghosal, S. Dirichlet process, related priors and posterior asymptotics. In Hjort, N. L., Holmes, C., M¨ ller, P., and u Walker, S. G. (eds.), Bayesian Nonparametrics, Cambridge Series in Statistical and Probabilistic Mathematics, pp. 35­79. Cambridge University Press, 2010. Griffiths, T. L. and Ghahramani, Z. Infinite latent feature models and the Indian buffet process. In NIPS, volume 18, 2005. Ishwaran, H. and Rao, J. S. Spike and slab variable selection: Frequentist and Bayesian strategies. Annals of Statistics, 33:730, 2005. Li, W. and McCallum, A. Pachinko allocation: DAGstructured mixture models of topic correlations. In ICML, volume 23, 2006. MacEachern, S. N. Dependent Dirichlet processes. Technical report, Department of Statistics, Ohio State University, 2000. MacKay, D. J. C. Information Theory, Inference & Learning Algorithms. Cambridge University Press, 2002. McLachlan, G. and Peel, D. Finite Mixture Models. John Wiley & Sons, 2000. Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet processes. JASA, 101(476):1566­ 1581, 2006. Teh, Y. W., G¨ r¨ r, D., and Ghahramani, Z. Stick-breaking ou construction for the Indian buffet process. In AISTATS, volume 11, 2007. Titsias, M. K. The infinite Gamma-Poisson feature model. In NIPS, volume 21, 2007. Wang, C. and Blei, D. M. Decoupling sparsity and smoothness in the discrete hierarchical Dirichlet process. In NIPS, volume 22, 2009. 6. Discussion We developed the IBP compound Dirichlet process (ICD), a Bayesian nonparametric prior over discrete, dependent distributions, which is an alternative to the hierarchical Dirichlet process (HDP). The ICD decouples the relationship between across-data prevalence and within-data proportion. We have used the ICD to construct the focused topic model (FTM), a generative model for collections of documents. We demonstrated that the FTM provides a better fit to text than the HDP topic model. The representations obtained by the FTM reflect lower correlation between across-data prevalence and within-data proportion than the representations obtained by the HDP. We have concentrated on correlation in HDPs. This correlation does not occur in finite mixed membership models with mixture components drawn from a Dirichlet(), where is fixed, since the draws from Dirichlet() are i.i.d. However, restriction to a fixed Dirichlet distribution limits the flexibility of mixed membership models, and if is unknown, there will be a similar correlation bias as with the HDP. Although easily derived, there is currently no parametric counterpart of the ICD. The idea of removing correlation between global and local probabilities is potentially applicable to a range of models. For example, in a state transition sequence, a certain state may be inaccessible from most other states, but occur with high probability following a small subset of states. Such a relationship will be poorly captured using the HDP-based Hidden Markov model (Teh et al., 2006), as it tends to correlate the global occurrence of a state with the probability of transitioning to it from another state. A model based on the ICD could better capture such relationships. Exploring which HDP applications benefit most from this decoupling is an avenue for further research. Acknowledgments. We thank anonymous reviewers for valuable comments. David M. Blei is supported by ONR 175-6343 and NSF CAREER 0745520. Indian Buffet Process Compound Dirichlet Process A. Approximating the expectation of m First we recall that · , · denote those elements of and associated with the topics that are currently represented in the corpus, and , denote the rest. As described in Section 4.1, in order to sample the topic allocations zi , we wish to approximate the expectation of m given the posterior distribution in Equation 3. This is an extension of the technique used to approximate the expectation of the distribution over words in Wang & Blei (2009). The expectation of the kth element of m is: E[mk |z ¬mi , ] d b : m bmk =1 (m) (nk,¬i where Jm is the set of j such that nj,¬i = 0, nj,¬i > 0, and (m) (·) J is the set of j such that = 0. The expectation and variance of Y conditioned on · , · , and , are E[Y | · , · , , ] = E[Y | · , · ] + E[Y |, ] V[Y | · , · , , ] = V[Y | · , · ] + V[Y |, ], where E[Y | · , · ] = V[Y | , ] = · · · jJm · jJm (·) nj (9) j j 2 j (1 - j ) j 2 2 2+1 . E[Y |, ] = + k )p(bm , (m) , n· |· , · , , ) V[Y |, ] = ( + 1) - (10) d b : m bmk =1 (nk,¬i + k )p(b | · , )p( |) m 2 P j (m) bmj j (n·,¬i + (m) . j bmj j ) We divide j bmj j into a "known" term X, corresponding to those topics with which words in the mth document are associated,and an "unknown" term, corresponding to those topics with which no words in the mth document are associated: bmj j = j (m) j:nj,¬i >0 In summary, Y is a summation over a finite number of topics that grows in expectation as (log(M ) - 1), where M is the total number of documents; and Y is a summation over a countably infinite number of topics. If the number of documents, and therefore the expected total number of topics, is large, then the central limit theorem suggests that we can approximate (X + Y ) using a Gaussian distribution with mean and variance as described in equations 9 and 10. We can then approximate the expectation in equation 8 using a second order Taylor expansion. In the second case, where nk,¬i = 0 and nk,¬i > 0, the expectation in equation 7 becomes: E[gX,k (Y )| · , · ] = k k E 1 (m) 2X+Y¬k +k (nk,¬i +X+Y¬k +k ) (m) (·) j + (m) j:nj,¬i =0 bmj j . Y X , Let gX,k (Y ) (m) bmk (nk,¬i +k ) X+Y (n(m) +X+Y 2 ·,¬i ) . Then we can write the expectation of mk in terms of the expectation of gX,k (Y ): E[mk |, nm ] E[gX,k (Y )| · , · ]. ¬i (m) nk,¬i (7) where Y¬k = Y - bmk k . We approximate the expectation (m) as described for the first case (where nk,¬i > 0), noting the slight changes in the expectation and variance of Y¬k compared with Y . In the third case, where nk,¬i = 0, we consider the probability of assigning the ith word to one of the infinite number of unobserved classes. Let z = u indicate assignment to a previously unseen topic. Equation 7 becomes: E[gX,u (Y )| · , · ] =E Y (m) 2X+Y +Y (n·,¬i +X+Y +Y ) (·) Consider approximating this expectation in three scenarios: > 0: the kth topic is currently represented in the document under consideration; (m) (·) 2. nk,¬i = 0 and nk,¬i > 0: the kth topic is not currently represented in the document under consideration, but is currently represented in the corpus; (·) 3. nk,¬i = 0: the kth topic is not currently represented in the corpus. 1. In the first case, where nk,¬i > 0, we know that bmk = 1, so the expectation in equation 7 is E[gX,k (Y )| · , · , n(m) ] = (nk,¬i + k )E (m) 1 2X+Y (n·,¬i +X+Y ) (m) . (11) (m) We approximate the expectation in equation 11 using a multivariate second order Taylor expansion, noting that Y and Y are independent conditioned on · , · , and . We have evaluated the quality of this approximation by comparing the approximated expectations with those estimated using Monte Carlo simulations, and found that the approximation holds well, provided is not large. The values of learned by the model in the experiments described in section 5 fall into this paradigm. (8) We divide Y into two components Y = Y + Y , where Y = jJm bmj j and Y = jJ bmj j .