Distributed Inference for Latent Dirichlet Allocation David Newman, Arthur Asuncion, Padhraic Smyth, Max Welling Department of Computer Science University of California, Irvine newman,asuncion,smyth,welling @ics.uci.edu ¡ Abstract We investigate the problem of learning a widely-used latent-variable model ­ the Latent Dirichlet Allocation (LDA) or "topic" model ­ using distributed computation, where each of processors only sees of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates--it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across processors--it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors. ¢ ¢¤ ¦¥£ ¢ 1 Introduction Very large data sets, such as collections of images, text, and related data, are becoming increasingly common, with examples ranging from digitized collections of books by companies such as Google and Amazon, to large collections of images at Web sites such as Flickr, to the recent Netflix customer recommendation data set. These data sets present major opportunities for machine learning, such as the ability to explore much richer and more expressive models, as well as providing new and interesting domains for the application of learning algorithms. However, the scale of these data sets also brings significant challenges for machine learning, particularly in terms of computation time and memory requirements. For example, a text corpus with 1 million documents, each containing 1000 words on average, will require approximately 12 Gbytes of memory to store the words, which is beyond the main memory capacity for most single processor machines. Similarly, if one were to assume that a simple operation (such as computing a probability vector over categories using Bayes rule) would take on the order of sec per word, then a full pass through words will take 1000 seconds. Thus, algorithms that make multiple passes over this sized corpus (such as occurs in many clustering and classification algorithms) will have run times in days. An obvious approach for addressing these time and memory issues is to distribute the learning algorithm over multiple processors [1, 2, 3]. In particular, with processors, it is somewhat trivial to get around the memory problem by distributing of the total data to each processor. However, the computation problem remains non-trivial for a fairly large class of learning algorithms, namely how to combine local processing on each of the processors to arrive at a useful global solution. 1 ¢ ¢ § £ ©§ ¨£ ©§ ¨£ In this general context we investigate distributed learning algorithms for the LDA model [4]. LDA models are arguably among the most successful recent learning algorithms for analyzing count data such as text. However, they can take days to learn for large corpora, and thus, distributed learning would be particularly useful for this type of model. The novel contributions of this paper are as follows: 2 Latent Dirichlet Allocation Before introducing our distributed algorithms for LDA, we briefly review the standard LDA model. LDA models each of documents as a mixture over latent topics, each being a multinomial word vocabulary. For document , we first draw a mixing proportion distribution over a from a Dirichlet with parameter . For the word in the document, a topic is drawn with topic chosen with probability , then word is drawn from the topic, with taking on value with probability . Finally, a Dirichlet prior with parameter is placed on the topics . Thus, the generative process is given by Given the observed words , the task of Bayesian inference is to compute the posterior distribution over the latent topic indices , the mixing proportions , and the topics . An efficient procedure is to use collapsed Gibbs sampling [5], where and are marginalized out, and the latent variables are sampled. Given the current state of all but one variable , the conditional probability of is means the corresponding data-item is excluded in the count values, and where the superscript where . We use the convention that missing indices are summed out: and . 3 Distributed Inference Algorithms for LDA We now present two versions of LDA where the data and the parameters are distributed over distinct processors. We distribute the documents over processors, with documents on each processor. We partition the data (words from the documents) into and the corresponding topic assignments into , where and only exist on processor . Document-specific counts are likewise distributed, however every processor maintains its own copy of word-topic and topic counts, and . We denote processor-specific counts as and . 3.1 Approximate Distributed Inference In our Approximate Distributed LDA model (AD-LDA), we simply implement LDA on each processor, and simultaneous Gibbs sampling is performed independently on each of the processors, as if each processor thinks it is the only processor. On processor , given the current state of all but one variable , the topic assignment to the word in document , is sampled from: ¢ Y ¡I #I X X X X 2 ©§¥ ¨¦ ¦ (0& ¨' R Y #" & ©!¥ ¨¦ I G db W¦ ech£ Y ¦ ¨ gE Wf 6) Y W¦ ¨db db) T T T R © ¤ I GE ¨ ' & HFD (#C1 8" P G db ¨db W¦ deb6)h£ T Y ¦ ¨ gE Wf e6) T Y ©W¦ ec T ¥ P eI ¡ ¡ R R X X X X ) ¦d $ S ©B1 A ¨¦¥ ¦ (Ud ¨' ¤ ¢ ¡¢ ¡ R 75 3 1 ¨' @) 649¦ (8& ©d ¨¦ P R P BR #" P ¨' P ¨¦ 'q d ¦ (0d ¦ '¦ q d ' Cy ©xd P w % P 8v2 tP rq d "u s '¦ X ¤ pUi `Y aU) ¡ `Y ) ¡ Q" X X I W $R P US T X ! 0 X V © I P A US T W R X X 0 V A I 7 5 3 1 ¨¦ 0 642 ©¥ ¡ P CI ¦ d ©!¥ ¨¦ R ¦ (#& ¨' ¡ ¦ (0d xd ¨' ¨¦ X £ S ¦ (#& ¨' % We introduce two algorithms that perform distributed inference for LDA models, one of which is simple to implement but does not necessarily sample from the correct posterior distribution, and the other which optimizes the correct posterior quantity but is more complex to implement and slower to run. We demonstrate that both distributed algorithms produce models that are statistically indistinguishable (in terms of predictive power) from models obtained on a single-processor, and they can learn these models much faster than using a single processor and only requiring storage of th of the data on each processor. (1) (2) (3) w|k K Z ij X ij Nj D k Figure 1: (Left) Graphical model for LDA. (Right) Graphical model for HD-LDA. Variables are repeated over the indices of the random variables. Square boxes indicate parameters. is not the result of separate LDA models running on separate data. In particular , where is the total number of words across all processors, as opposed to the number of words on processor . After processor has reassigned , we have modified counts , , and . To merge back to a single set of counts, after a number of Gibbs sampling steps (e.g., after a single pass through the data on each processor) we perform the global update, using a reduce-scatter operation, where are the counts that all processors started with before the sweep of the Gibbs sampler. are computed by . Note that this global update correctly reflects the The counts topic assignments (i.e., can also be regenerated using ). We can consider this algorithm to be an approximation to the single-processor Gibbs sampler in the following sense: at the start of each iteration, all of the processors have the same set of counts. However, as each processor starts sampling, the global count matrix is changing in a way that is unknown to each processor. Thus, in Equation 3, the sampling is not being done according to the true current global count (or true posterior distribution), but to an approximation. We have experimented with "repairing" reversibility of the sampler by adding a phase which re-traces the Gibbs moves starting at the (global) end-state, but we found that, due to the curse-of-dimensionality, virtually all steps ended up being rejected. 3.2 Hierarchical Distributed Inference A more principled way to model parallel processes is to build them directly into the probabilistic model. Imagine a parent collection of topics . This parent has children which represent the topic distributions on the various processors. We assume is sampled from according to a Dirichlet distribution with topic-dependent strength parameter . The model that lives on each processor is simply an LDA model. Hence, the generative process is given by, The graphical model corresponding to this Hierarchical Distributed LDA (HD-LDA) is shown on the right of Figure 1, with standard LDA shown on the left for comparison. This model is different than the two other topic hierarchies we found in the literature, namely 1) the deeper version of the hierarchical Dirichlet process mentioned in [6] and 2) Pachinko allocation [7]. The first places a deeper hierarchical prior on (instead of on ) while the second deals with a document-specific hierarchy of topic-assignments. These types of hierarchies do not suit our need to facilitate parallel computation. 3 7 $ %X # 1 5 ¦ (( ¨' ¦ (0d ¨' R ¦ (0d ¨' ¢ ¦) R 'G ¨ ' (gE D (U&1 " 7 5641 ©¥ 3 ¨¦ 5 1 ¦ ) 7 ! "X X Y¨' ¨' ©¦ (0d ¦ (0d T S ¦ (' ¨ ¦ (0d ' ¨' P ¦d ¢ b ¦ (0d ¨' ©4 ¨¦¥ 1 7w¦ ¨(' ¦ ) 63 1 ¦ ( 5 ¨' 753 8 641 ¦ ( ¨' S ¥ ¦ (xd ¨' © ¨' ¦ (0d ¦ d R © ¦ (0d ¨' Note that ¢ £¡ w|kp k| j w|k k | jp Z ijp X ijp P K N jp Dp P ¥ ¦¤ (4) (5) p ¦d P ¦ (0d ¦ ¨' ¨' § ¦ (xd ¨' ¦ (0d ©xd ¨' ¨¦ In our experiments we learn MAP estimates for the global variables , and . Alternatively, one can derive Gibbs sampling equations using the auxiliary variable method explained in [6], but we leave exploration of this inference technique for future research. Inference is thus based on integrating out and , sampling and learning the MAP value of , and . The entire algorithm can be understood as expectation maximization on a collapsed space where the M-step corresponds to MAP-updates and the E-step corresponds to sampling. As such, the proposed Monte Carlo EM (MCEM) algorithm is guaranteed to converge in expectation (e.g., [8]). The MAP learning rules are derived by using the bounds derived in [9]. They are given by where is the digamma function. Careful selection of hyper-parameters is critical to making HDLDA work well, and we used our experience with AD-LDA to guide these choices. For AD-LDA , but for HD-LDA , so we choose and to make the mode of . We set . Finally we choose and to make the mode of , matching the value of used in our LDA and AD-LDA experiments. We can view HD-LDA as a mixture model with LDA mixture components, where the data have been hard-assigned to their respective clusters (processors). The parameters of the clusters are generated from a shared prior distribution. This view clarifies the procedure we have adopted for testing: First we sample assignment variables for the first half of the test document (analogous to folding-in). Given these samples we compute the likelihood of the test document under the model for each processor. Assuming equal prior weights for each processor we then compute responsibilities, which are given by the likelihoods, normalized over processors. The probability of the remainder of the test document is then given by the responsibility-weighted average over the processors. £§ 4 Experiments The two distributed algorithms are initialized by first randomly assigning topics to , then from this counting topics in documents, , and words in topics, , for each processor. Recall for AD-LDA that the count arrays are the same on every processor (initially, and after every global update). For each run of each algorithm, a sample was taken after 500 iterations of the Gibbs sampler, well after the typical burn-in period of 200-300 iterations. Multiple processors were simulated in software (by separating data, running sequentially through each processor, and simulating the global update step), except for the speedup experiments which were run on a 16processor computer. It is not obvious a priori that the AD-LDA algorithm will in general converge to a useful result. Later in this section we describe a set of systematic empirical results with AD-LDA, but we first use an illustrative toy example to provide some insight as to how AD-LDA learns a model. The toy example has words, topics. The left panel of Figure 2 shows the distance between the model's estimate of a particular topic-word distribution and the true distribution, as a function of Gibbs iterations, for both single-processor LDA and AD-LDA with . LDA and AD-LDA have qualitatively the same 3-phase learning dynamics1 . The first 4 or so iterations ("early burnin") correspond to somewhat random movement close to the randomly initialized starting point. In 1 For clarity, the results in this figure are plotted for a single run, single data set, etc.--we observed qualitatively similar results over a large variety of such simulations 4 ¥ P Y As is the case for LDA, inference for HD-LDA is most efficient if we marginalize out derive the following conditional probabilities necessary for the Gibbs sampler, and . We (6) ¤§Y©¦ (' ¦ ) ¢ Y ¦ ¨('0db ¦ ¨(' ¦ ) ¦¢ ¦ ( ¦ ) ' b £ @ ¨ ¨' £ T T ¤§©¦ ¨( ¦ ) ¢ Y ¦ (0db ¦ ( ¦ ) T ¦¢ ¦ ( ¦ ) a£ Y' ¨' ¨' ¨' b T ¢ Y$ ¦ deb ¦ ) ¢ b T5 ¦¢ ¦ ( ' ! ¦ ea£ ¨' )b ¨G ¨ W¦ eb ¦ ) Y ¦ gE Wf eb ¦ ¨ gE f ¦ ) Y ©W¦ eb G d T d T d T R ¤Y ¨ ' §©¦ ( ! P ) ¦ )T ¢ ) ¢ 7 Y ¦ ) T Y ¦ (0b ¦ ( ¦ ) T ¨'d ¨' ¦ (xd ¨' $ ¤Y ¥!0 T ¢ teb ! eb 0 ¢ 7Y0 ¢ Yd ¢ T T5 $ ¦ q 0ea£ # Y ©xeb 0 T ¨¦d b # ¨ ¨' © ¦ (0d ¢ ¢ ¦ (0tP ¦ (xd ¨'d ¨' d ¨¦ ¦ ¨' § £¢¡ `Y ¢ R P )I W R P A US T X X X X 0 V b 2£ P ¢ P ¦ ( ¨' © ¦) ¥ (7) © P ¦) P ¦ (0d ¦ ¨' ¨' § £ ¢ 0.4 0.35 0.3 L1 norm 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 Iteration 20 early burn-in LDA AD-LDA proc1 AD-LDA proc2 0.4 LDA AD-LDA proc1 AD-LDA proc2 0.35 0.3 start 0.35 LDA AD-LDA proc1 AD-LDA proc2 AD-LDA proc3 ...etc... AD-LDA proc10 start burn-in 0.3 0.25 0.25 equilibrium 0.2 25 30 0.2 topic mode 0.55 0.6 0.65 0.7 topic mode 0.55 0.6 0.65 0.7 Figure 2: (Left) distance to the mode for LDA and for P=2 AD-LDA. (Center) Projection of topics onto simplex, showing convergence to mode. (Right) Same setup as center panel, but with processors. the next phase ("burn-in") both algorithms rapidly move in parameter space towards the posterior mode. And finally at equilibrium, both are sampling around the mode. The center panel of Figure 2 plots the same run, in the 2-d planar simplex corresponding to the 3-word topic distribution. This panel shows the paths in parameter space of each model, taking a few small steps near the starting point (top right corner), moving down to the true solution (bottom left), and then sampling near the posterior mode for the rest of the iterations. For each Gibbs iteration, the parameters corresponding to each of the two individual processors, and those parameters after merging, are shown (for ADLDA). We observed that after the initial few iterations, the individual processor steps and the merge step each resulted in a move closer to the mode. The right panel in Figure 2 illustrates the same qualitative behavior as in the center panel, but now for 10 processors. One might worry that the AD-LDA algorithm would get "trapped" close to the initial starting point, e.g., due to repeated label mismatching of the topics across processors. In practice we have consistently observed that the algorithm quickly discards such configurations (due to the stochastic nature of the moves) and "latches" onto a consistent labeling that then rapidly moves it towards the posterior mode. It is useful to think of LDA as an approximation to stochastic descent in the space of assignment variables . On a single processor, one can view Gibbs sampling during burn-in as a stochastic algorithm to move up the likelihood surface. With multiple processors, each processor computes an upward direction in its own subspace, keeping all other directions fixed. The global update step then recombines these directions by vector-addition, in the same way as one would compute a gradient using finite differences. This is expected to be accurate as long as the surface is locally convex or concave, but will break down at saddle-points. We conjecture AD-LDA works reliably because saddle points are 1) unstable and 2) rare due to the fact that the posterior appears often to be highly peaked for LDA models and high-dimensional count data sets. To evaluate AD-LDA and HD-LDA systematically, we measured performance using test set pertest plexity, computed as Perp test . For every test document, half the words test (at random) are put in a fold-in part, and the remaining words are put in a test part. The document mix is learned using the fold-in part, and log probability is computed using this mix and words from the test part, ensuring that the test words are never seen before being used. For AD-LDA, the perplexity computation exactly follows that of LDA, since a single set of topic counts are saved when a sample is taken. In contrast, all copies of are required to compute perplexity for HD-LDA, as described in the previous section. Except where stated, perplexities are computed for all algorithms using samples from the posterior (from 10 different chains) using with the analogous expression being used for HD-LDA. We compared LDA (Gibbs sampling on a single processor) and our two distributed algorithms, ADLDA and HD-LDA, using three data sets: KOS (from dailykos.com), NIPS (from books.nips.cc) and NYTIMES (from ldc.upenn.edu). Each data set was split into a training set and a test set. Size is the vocabulary size and parameters for these data sets are shown in Table 1. For each corpus is the total number of words. Using the three data sets and the three models we computed test set 5 ¦ ¨ P ¦ (' & £ ¨ P ©¦ ¥ ¦ (' & ©¦ ¥ ¨ ¨ ¦ '§ PY test ¦ ¨(' d db6)h£ b) ¦ (d ¨' d 6 b eec ¢ db ©¦ ¨ ¦ (d ¨' Y FY I S T © ¨§¦ ¢ T ¤¢ ¥£¡P £ Y © ¨§¦ § £ IT P I S T R © §¦ ©¥ ¨¦ § ¨£ P ¢ (8) © train test KOS 3000 6906 410,000 430 NIPS 1500 12,419 1,900,000 184 NYTIMES 300,000 102,660 100,000,000 34,658 Table 1: Size parameters for the three data sets used in perplexity and speedup experiments. 1750 1700 1650 1600 1550 1500 T=8 2000 T=10 1900 1800 Perplexity T=16 LDA AD-LDA HD-LDA T=20 LDA AD-LDA HD-LDA Perplexity 1700 T=32 1600 T=40 1450 1400 1350 1500 T=64 T=80 P=1 P=10 P=100 1400 P=1 Figure 3: Test perplexity of models versus number of processors P for KOS (left) and NIPS (right). P=1 corresponds to LDA (circles), and AD-LDA (crosses), and HD-LDA (squares) are shown at P=10 and 100 . Figure 3 clearly shows that, for a fixed number of topics, the perplexity results are essentially the same whether we use single-processor LDA or either of the two algorithms with data distributed across multiple processors (either 10 or 100). The figure shows the test set perplexity for KOS perplexity is computed by (left) and NIPS (right), versus number of processors, . The LDA (circles), and we use our distributed models ­ AD-LDA (crosses), and HD-LDA (squares) ­ to compute the and perplexities. Though not shown, perplexities for AD-LDA remained approximately constant as the number of processors was further increased to for KOS and for NIPS, demonstrating effective distributed learning with only 3 documents on each processor. It is worth emphasizing that, despite no formal convergence guarantees, the approximate distributed algorithm converged to good solutions in every single one of the more than one thousand experiments we did using five real-world data sets, plus synthesized data sets designed to be "hard" to learn (i.e., topics mutually exclusively distributed over processors)--page limitations preclude a full description of all these results in this paper. 2000 LDA AD-LDA P=10 AD-LDA P=100 HD-LDA P=10 HD-LDA P=100 Perplexity 2000 1900 1800 1700 1600 1500 1400 1300 1200 1100 1700 50 100 150 200 250 Iteration 300 350 400 1000 0 100 200 300 400 Number of Topics 500 600 700 LDA AD-LDA P=10 HD-LDA P=10 1950 1900 Perplexity 1850 1800 1750 Figure 4: (Left) Test perplexity versus iteration. (Right) Test perplexity versus number of topics. To properly determine the utility of the distributed algorithms, it is necessary to check whether the parallelized samplers are systematically converging more slowly than single processor sampling. If 6 §§ § £ P ¢ £ ¢ P ¢ ¢ § £ § ¢ perplexities for a range of topics our distributed models. P ¢ £© ¡ § £ §¦ § ¡ P , and for number of processors, , ranging from 10 to 1000 for P=10 P=100 ¡P ¢ ¢ 0.5 0.45 0.4 0.35 Precision 0.3 0.25 0.2 0.15 Speedup TF-IDF LDA AD-LDA HD-LDA 16 14 12 10 8 6 4 Perfect AD-LDA 0.1 0.05 0 AP FR 2 2 4 6 8 10 12 Number of Processors, P 14 16 Figure 5: (Left) Precision/recall results. (Right). Parallel speedup results. this were the case, it would mitigate the computational gains of parallelization. In fact our experiments consistently showed (somewhat surprisingly) that the convergence rate for the distributed algorithms is just as rapid as for the single processor case. As an example, Figure 4 (left) shows ). During burn-in, up test perplexity versus iteration number of the Gibbs sampler (NIPS, to iteration 200, the distributed models are actually converging slightly faster than single processor LDA. Also note that 1 iteration of AD-LDA (or HD-LDA) on a parallel computer takes a fraction of the wall-clock time of 1 iteration of LDA. We also investigated whether the results were sensitive to the number of topics used in the models, e.g., perhaps the distributed algorithms' performance diverges when the number of topics becomes very large. Figure 4 (right) shows the test set perplexity computed on the NIPS data set using samples, as a function of the number of topics, for the different algorithms and a fixed number of processors (not shown here are the results for the KOS data set which were quite similar). The perplexities of the different algorithms closely track each other as varies. Sometimes the distributed algorithms produce slightly lower perplexities than those of single processor LDA. This lower perplexity may be due to: for AD-LDA, parameters constantly splitting and merging producing an internal averaging effect; and for HD-LDA, test perplexity being computed using copies of saved parameters. Finally, to demonstrate that the low perplexities obtained from the distributed algorithms with processors are not just due to averaging effects, we split the NIPS corpus into one hundred 15document collections, and ran LDA separately on each of these hundred collections. Test perplexity ( ) computed by averaging 100-separate LDA models was 2117, versus the P=100 test perplexity of 1575 for AD-LDA and HD-LDA. This shows that simple averaging of results from separate processors does not perform nearly as well as the distributed coordinated learning. Our distributed algorithms also perform well under other performance metrics. We performed precision/recall calculations using TREC's AP and FR collections and measured performance using the well-known mean average precision (MAP) metric used in IR research. Figure 5 (left) again shows that AD-LDA and HD-LDA (both using P=10) perform similarly to LDA. All three LDA models have significantly higher precision than TF-IDF on the AP and FR collections (significance was computed using a t-test at the 0.05 level). These calculations were run with . The per-processor per-iteration time and space complexity of LDA and AD-LDA are shown in Table 2. AD-LDA's memory requirement scales well as collections grow, because while and can get arbitrarily large (which can be offset by increasing ), the vocabulary size asymptotes. Similarly the time complexity scales well since the leading order term is divided by . The term accounts for the communication cost of the reduce-scatter operation on the count difference , which is executed in stages. Because of the additional term, parallel efficiency will depend on , with increasing efficiency as this ratio increases. Space and time complexity of HD-LDA are similar to that of AD-LDA, but HD-LDA has bigger constants. Using our large NYTIMES data set, we performed speedup experiments on a 16-processor SMP shared memory computer using 1, 2, 4, 8 and 16 processors (since we did not have access to a distributed memory computer). The single processor LDA run with 1000 iterations for this flops, and takes more than 10 days on a 3GHz workstation, so it is an ideal data set involves § ¦ 7 P ¢ ¡ ¢ ¢ © £ §§ £¢ P ¢ ¢ ¢ P © ¢ ¢ ¢ © §¦ P ¢ £ ¤ § ¨£ P ¢ ¥ ¦ § ¨£ Y©¦ ¨('0d ¦ ¨('0d T § ¡ P P §§ £ ¢ ¢ Table 2: Space and time complexity of LDA and AD-LDA. computation to speed up. The speedup results, shown in Figure 5 (right), show reasonable parallel efficiency, with a speedup using processors. This speedup reduces our NYTIMES 10day run (880 sec/iteration on 1 processor) to the order of 1 day (105 sec/iteration on 16 processors). Note, however, that while the implementation on an SMP machine captures some distributed effects (e.g. time to synchronize), it does not accurately reflect the extra time for communication. However, we do expect that for problems with large , parallel efficiency will be high. 5 Discussion and Conclusions Prior work on parallelizing probabilistic learning algorithms has focused largely on EMoptimization algorithms, e.g., parallel updates of expected sufficient statistics for mixture models [2, 1]. In the statistical literature, the idea of running multiple MCMC chains in parallel is one approach to parallelization (e.g., the method of parallel tempering), but requires that each processor store a copy of the full data set. Since MCMC is inherently sequential, parallel sampling using distributed subsets of the data will not in general yield a proper MCMC sampler except in special cases [10]. Mimno and McCallum [11] recently proposed the DCM-LDA model, where processorspecific sets of topics are learned independently on each processor for local subsets of data, without any communication between processors, followed by a global clustering of the topics from the different processors. While this method is highly scalable, it does not lead to single global set of topics that represent individual documents, nor is it defined by a generative process. We proposed two different approaches to distributing MCMC sampling across different processors for an LDA model. With AD-LDA we sample from an approximation to the posterior density by allowing different processors to concurrently sample latent topic assignments on their local subsets of the data. Despite having no formal convergence guarantees, AD-LDA works very well empirically and is easy to implement. With HD-LDA we adapt the underlying LDA model to map to the distributed computational infrastructure. While this model is more complicated than AD-LDA, and slower to run (because of digamma evaluations), it inherits the usual convergence properties of MCEM. Careful selection of hyper-parameters was critical to making HD-LDA work well. In conclusion, both of our proposed algorithms learn models with predictive performance that is no different than single-processor LDA. On each processor they burn-in and converge at the same rate as LDA, yielding significant speedups in practice. The space and time complexity of both models make them scalable to run on enormous problems, for example, collections with billions to trillions of words. There are several potentially interesting research directions that can be pursued using the algorithms proposed here as a starting point, e.g., using asynchronous local communication (as opposed to the environment of synchronous global communications covered in this paper) and more complex schemes that allow data to adaptively move from one processor to another. The distributed scheme of AD-LDA can also be used to parallelize other machine learning algorithms. Using the same principles, we have implemented distributed versions of NMF and PLSA, and initial results suggest that these distributed algorithms also work well in practice. 6 Acknowledgements This material is based upon work supported by the National Science Foundation: DN and PS were supported by NSF grants SCI-0225642, CNS-0551510, and IIS-0083489, AA was supported by an NSF graduate fellowship, and MW was supported by grants IIS-0535278 and IIS-0447903. 8 Y £b £ ¤£ P £ ¤ ¢ Space Time b £ ¢ b ¢ £ ¢ e0Y ¢b ¢ eb © T ¡ © LDA AD-LDA © eb ¢ ¡T ¢ © ¡ ¢ References [1] C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-Reduce for machine learning on multicore. In NIPS 19, pages 281­288. MIT Press, Cambridge, MA, 2007. [2] W. Kowalczyk and N. Vlassis. Newscast EM. In NIPS 17, pages 713­720. MIT Press, Cambridge, MA, 2005. [3] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online collaborative filtering. In 16th International World Wide Web Conference, 2007. [4] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993­1022, 2003. [5] T. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences, volume 101, pages 5228­5235, 2004. [6] Y.W. Teh, M. Jordan, M. Beal, and A. Blei. Sharing clusters among related groups: Hierarchical Dirichlet processes. In NIPS 17, pages 1385­1392. MIT Press, Cambridge, MA, 2005. [7] W. Li and A. McCallum. Pachinko allocation: DAG-structured mixture models of topic correlations. In ICML, pages 577­584, 2006. [8] G. Wei and M. Tanner. A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithms. Journal of the American Statistical Association, 85(411):699­ 704, 1990. [9] T. Minka. Estimating a Dirichlet distribution. http://research.microsoft.com/ minka/papers/dirichlet/, 2003. [10] A. Brockwell. Parallel markov chain monte carlo simulation by pre-fetching. J.Comp.Graph.Stats, volume 15, pages 246­261, 2006. In [11] A. McCallum D. Mimno. Organizing the oca: Learning faceted subjects from a library of digital books. In Joint Conference in Digital Libraries, pages 376­385, 2007. 9